Xöömei for linguists
How does a soloist sing a duet?

Jason Wells-Jensen, Kent State University
Adapted from a Linguistics Brown Bag talk
at Bowling Green State University
October 20, 2006

Disclaimers and Acknowledgements

I think it's important to clarify from the outset that these notes do not represent original research, and in fact virtually none of the ideas or analysis presented here or in my BGSU talk are original; my goal is simply to present a basic introduction to overtone singing beginning from the perspective of linguistic phonetics, and ideally to demystify the phenomenon by showing its direct relationship to the articulation and perception of ordinary speech. [continued below]


There is a class of vocal techniques known collectively as harmonic singing, overtone singing, or xöömei - a Mongolian word, usually translated as "throat singing" in English. The singer emphasizes or reinforces certain harmonics or overtones in the voice, while muting others, with the result that he or she seems to be singing two or more distinct pitches at the same time.

I'll be drawing specific examples mainly from two distinct Tuvan styles of xöömei called sygyt and kargyraa. Tuva, by the way, is part of the Russian Federation and is located in southern Siberia, on the Mongolian border; it contains a monument purporting to mark the geographic center of Asia. I happen to be from North Dakota, which not only resembles Tuva in terms of size, population, and climate, but also contains a monument marking the center of North America!

For the ethnomusicologists and other scholars out there: I'm afraid I can't begin to do justice to the real "musical" traditions of Central Asia (not to mention the anthropological or sociopolitical situations), so I won't even try. Also, although I will go off on some tangents, I won't really talk about some obviously related topics like Tibetan Buddhist harmonic chanting, or Euro-American approaches to overtone singing (e.g. David Hykes or Karlheinz Stockhausen).

Here are two musical staves illustrating the approximate pitch ranges of the two Tuvan styles I'll be talking about. In the sygyt style, the fundamental drone generally lies somewhere in the comfortable middle range of the voice, while the 'melody' consists of whistle-like reinforced harmonics several octaves above that. Kargyraa style employs a characteristic low growling tone, produced by other soft tissues in the vocal tract vibrating sympathetically an octave below the frequency of the vocal folds. This, among other things, gives the singer more harmonics to work with in a given range of frequencies, for reasons that should become somewhat clearer later on.

Musical staff showing harmonic series and corresponding 'sygyt' scale Musical staff showing harmonic series and corresponding 'kargyraa' scale

My main point is going to be this: Overtone singing involves using universal cognitive, physical, and auditory resources - although it combines them in a relatively unusual way. It seems exotic and foreign at first, but there's really nothing unearthly about it. I'm sure, like most things, it's easier to learn if you've been immersed in the relevant culture from birth. But there's nothing to stop an individual with talent, interest, and dedication from learning to do it.

For example, here's a clip from a radio interview (excerpted from "Brownian Motion" on KZSU radio) with voice actor Billy West, a native of Detroit. He replaced Mel Blanc as the voice of Bugs Bunny and Porky Pig, and he has also done many voices on TV shows like Ren and Stimpy and Futurama.

Billy West interview (He talks about Tuvan throat singing and the voice of Popeye in track 5)

Billy West clearly knows his business as a voice artist and entertainer. But I'd like to give you a slightly fairer representation of the film, Genghis Blues, and of Tuvan music and culture. The film was nominated for an Oscar as best documentary in 2000, and the adventure that it documents was inspired and made possible in part by the late physicist and Nobel laureate Richard Feynman, who had collected Tuvan postage stamps as a child, and who founded the "Friends of Tuva" with his close friend Ralph Leighton.

During my talk at BGSU, I showed several clips from Genghis Blues; I recommend seeing the whole movie if you have a chance

The main characters are Tuvan People's Singer Kongar-ol Ondar and Paul "Earthquake" Pena, who died in October 2005. He was an American blues musician and the composer of Steve Miller's hit song "Jet Airliner" - and one of the first Americans to learn Tuvan throat-singing in the mid 1980s. The movie focuses on his trip to Tuva in 1995 to participate in the national xöömei symposium.

By the way, some of you might be particularly interested in the sociolinguistic issues illustrated here; the Tuvan language is Turkic (with agglutinating suffixes and vowel harmony), but Tuva was for many decades part of the Soviet Union, and is now part of the Russian Federation, so there is a great deal of Russian spoken in the film as well, especially in "official" contexts such as the introductions at the symposium.

Paul Pena had learned as much of the Tuvan language as he could under the circumstances (using Tuvan-Russian and Russian-English phrasebooks), but still had to interact mainly through Russian interpreters during his trip. He was also blind and on medication for depression, which made some parts of the adventure even more complicated and dramatic. This film is not generally in stock at Blockbuster, but it's worth tracking down - try Interlibrary Loan!

Of course, one of the reasons this story was interesting enough to make a film about is that Paul Pena was not a traditional Tuvan throat-singer; Ted Levin's Where Rivers and Mountains Sing DVD and Steve Sklar's khoomei.com web site contain some samples of real Tuvan musicians who exemplify really masterful throat-singing.

One of the best examples of multiple styles on khoomei.com is the "Orphan's Lament" video clip featuring Kaigal-ool Xovalyg of Huun-Huur-Tu.

Ted Levin emphasizes the notion that until very recently throat-singing had developed among nomadic herders as a private activity -- a kind of "whistle while you work" phenomenon, perhaps, as opposed to the performance of music as "art" or as an "entertainment product". Of course, many contemporary Tuvan singers (including Huun-Huur-Tu) do pursue their craft in the context of both art and entertainment, but those concepts historically were largely imposed from the outside.

This makes me think back to Billy West, who, as a voice actor, is really someone I think of as an artist, but his "art" has no sanctioned venue in our culture except in the context of "entertainment".

Also, Levin points out that overtone singing need not necessarily be a "musical" activity - it is also part of a larger culture of what he calls "sound mimesis": the imitation and representation of sounds from the environment.

Traditionally, it is said that women did not throat-sing; however, Genghis Blues contains footage showing both boys and girls studying at Ondar's arts academy, and Levin's Where Rivers and Mountains Sing DVD includes clips of an all-female Tuvan ensemble called Tyva Kyzy as well as a female Altai singer, Raisa Modorova, who sings in a style similar to kargyraa.

So, some of you are still wondering, how is it done?

Again, my main point is basically this: After you scratch the surface, lots of things about throat-singing look familiar, both from the musical point of view and the linguistic-phonetic.

The Source-Filter Model

Vibrating objects tend to vibrate at many frequencies simultaneously -- the "fundamental" or lowest frequency (F0 = "F zero") is strongly correlated with the perceived pitch. For strings (including, loosely speaking, the vocal folds) the higher frequencies tend to be nearly integral multiples of F0. In most cases, these "overtones" or "harmonics" are much lower in amplitude than the fundamental, and are perceived as part of the "timbre" or "color" or "quality" of the overall sound -- not as a separate pitch component. So, the truth is that we're all singing lots of different notes simultaneously all the time - we just don't hear it that way!

As the source wave passes through a filter (in this case, the pharynx), some frequencies are passed along or amplified while others are muted, depending on various aspects of the size and shape of the chamber. The resonant frequencies -- those frequencies at which acoustic energy is passed along or amplified, or at least not muted -- are called formants.

The vocal tract, of course, can change its size and shape -- you can open and close your jaw, move your tongue around, constrict or open your throat, etc. -- which allows continuous modulation of timbre. This makes human speech possible, because the difference between one vowel sound and another is quite simply a difference in vocal timbre or sound color -- and remember, "timbre" in turn is really the result of a combination of frequencies.

The late phonetician Peter Ladefoged set up a nice tidy speech synthesis demonstration that allows us to listen independently to (synthesized versions of) individual frequency components of speech, and then hear them put together.

We can see this fairly clearly by comparing an audio recording with a spectrogram, where the horizontal axis represents time, the vertical axis represents frequency, and darkness represents amplitude - here, then, the formant frequencies show up as characteristic dark bands. The two lowest formants (usually labeled "F1" and "F2") are generally the most important for distinguishing vowels. The F1 frequency is related to jaw opening, while F2 is related to the forward and backward movement of the tongue body.

The essence of formants, from one point of view, is this: For a given combination of jaw and tongue position (among other things), there are frequencies at which your vocal tract likes to vibrate, and other frequencies at which it doesn't -- this is theoretically independent of your vocal fold frequency.

Here, for example, are the vowels [i a u], pronounced by a human subject (who shall remain nameless), each spoken with relatively-normal falling intonation. As we will see later, the dark formant bands for these three vowels appear fairly consistently in these patterns, even when the subject is doing quite different things with his voice source.

Spectrogram of vowels [iau] spoken with falling intonation
Click on each spectrogram to hear the corresponding audio sample

We usually perceive a given vocal formant pattern as a particular vowel, regardless of the actual frequencies of vibration in the source wave. As I mentioned earlier, F0 and overtones of the vocal fold vibrations are (theoretically) independent of the characteristic resonances or formants of the pharynx -- but in real speech or singing they have to work together. If the source contains vibrations at a frequency transmitted by the filter, great! But if not, that vibration will be inaudible.

Here are more samples of Speaker X, pronouncing the vowels [i a u] with both falling and rising intonation. In the spectrogram, we can see the individual harmonics falling and rising, while the overall resonance pattern for each vowel is relatively constant. When the harmonic rises or falls out of a formant band, it becomes inaudible, but in most cases another harmonic comes along to take its place.

Spectrogram of vowels [iau] spoken with falling intonation Spectrogram of vowels [iau] spoken with rising intonation

Here is a spectrogram of the same subject singing these vowels on a simple melody. Here we can see that, even though the vocal frequencies are changing dramatically -- with more closely-spaced harmonics for the low notes and fewer widely-spaced harmonics for the high notes -- the formant pattern for each vowel is still basically constant.

Spectrogram of vowels [iau] sung in 'normal' voice

Finally, here is the same poor subject singing these three vowels in kargyraa voice, where the harmonics are even more closely-spaced (because the fundamental is an octave lower) and the formant bands are even more distinct. There are really two sound sources in kargyraa singing: the fundamental vibration of the vocal folds is supplemented by facilitating the sympathetic vibration of other soft tissues in the vocal tract, typically the ventricular or "false" vocal folds. It's relatively easy to compel these sympathetic vibrations to occur at about half the frequency of the vocal fold vibrations, yielding the "Popeye" effect we heard Billy West talk about earlier.

Spectrogram of vowels [iau] sung in 'kargyraa' voice

Getting back to the harmonic series - although the overtones of a given fundamental frequency are integral multiples (e.g. 100Hz, 200Hz, 300Hz, etc.), humans perceive musical pitch according to a logarithmic transformation of frequency -- meaning we perceive the lower harmonics as further apart and the higher harmonics as closer together.

So here's what a harmonic series looks like on a musical staff -- this will look familiar to brass players, since it's essential to how their instruments work. It establishes the "scale" that's playable without moving the slide or changing valves. Similarly, some portion of the harmonic series constitutes the scale an overtone singer can use with a given fundamental. In this sense, throat singing is very much like playing an instrument.

Musical staff showing harmonic series of C (130 Hertz)

The actual usable range within the harmonic series depends on the combination of the fundamental itself and the dimensions of the individual instrument (or vocal tract) -- i.e. how low and how high can your formants go? Generally, a singer chooses a fundamental whose 6th to 13th harmonics are within the range of a formant -- yielding a very familiar pentatonic scale.

Comparing different vocalizations: Source vs. filter characteristics

Now that we've examined some speech spectrograms, and looked at the harmonic series in schematic terms, here's a spectrogram of some amateur kargyraa singing by Speaker X. This actually looks pretty familiar, because kargyraa is articulated a lot like speech -- and, in fact, we already know you can talk in kargyraa voice: Billy West, as Popeye, uses a relatively high-frequency, tense type of kargyraa voice, and Frank Oz seems to use a somewhat more relaxed, lower-frequency ventricular vibration for the voices of characters like Cookie Monster, Grover, and Yoda.

Spectrogram of 'kargyraa' singing

In kargyraa singing, as in this example, F1 and F2 usually move in parallel octaves, while the two static "fundamental" tones are also an octave apart, so listeners may only perceive a single very low fundamental and a single melody above it.

Again, F1 is mainly determined by jaw aperture [aka vowel "height", in phonological terms] and F2 by tongue advancement [aka "backness"] ranging from a high back vowel articulation [u] for lower notes to a low front vowel [æ] for the highest notes. When kargyraa is done skillfully, of course, the articulations are more precise than speech, the formant bands are narrower, and it's easier to perceive the harmonics as a separate melody rather than merely changes in tone color or vowel quality.

Sygyt, although it works according to the same essential principles, is very much unlike articulate speech in some ways. You start with basically the same vocal source (with a lot of laryngeal tension), but the filtering is all done from an extremely narrow range of un-vowel-like tongue positions, most of which could be described phonetically as a palatal lateral (e.g. Italian "gl"). And the result is a single, very loud, formant. I'm afraid I really can't tell you much else about the specific articulation; it mainly involves very subtle movements of the back of the tongue and the lips, while the tongue tip remains stuck to the palate. All I can suggest is that you give it a try!

Spectrogram of 'sygyt' singing


One of the areas where throat-singing really casts things in a new light, I think, is the perceptual boundary between "timbre" and "pitch" -- yes, there are some psychoacoustic thresholds, but it's clear that these can be influenced by "training" and by context (cultural, linguistic, and/or musical). We can, for example, learn to listen better to xöömei; with practice, it becomes easier and easier to perceptually separate the overtone melody from the fundamental drone.

One of Ted Levin's key points is that Tuvan music (or rather Tuvan "sound culture") is much more concerned with timbre than Western music, and consequently a less focused on melody and harmony. Ironically, Tuvan singers have developed timbral manipulation to such a virtuosic level that it's led to new ways of making melodies!

To be fair, though, the Tuvans are not really alone in their manipulation of overtones. Skilled singers (and mimics) always manipulate the filtering of overtones, whether consciously or not, since this is the primary means of controlling vocal quality. Opera singers can project their voices over an orchestra through selective filtering of overtones, and barbershop quartets "ring" their chords by carefully matching the overtones from four different voices.

To this, Tuvan throat-singers add a distinctive laryngeal tension, which helps emphasize higher frequencies at the expense of the fundamental, and kargyraa singers supplement the fundamental vibration of the vocal folds with "sub-harmonic" vibrations. I mentioned earlier that timbre is the result of combinations of frequencies or "pitches" that are not perceived as such. Both of these special throat-singing techniques help facilitate the transformation of timbre back into audible pitch.

OK, here's the big philosophical leap! Ultimately, all conceptual categories, no matter how grounded in nature they seem to be, are conditioned and altered by language and culture and context -- to grossly oversimplify, our experience always makes us "hear things that aren't there, and fail to hear things that are"! In linguistics, we know this is true of textbook phenomena like phonological tone and intonation, as well as the categorical perception of consonants and vowels, and many kinds of unconscious grammatical "rules". And as we've seen today, it's also true of musical notions like timbre and pitch.

So I think the most important lesson of xöömei is a pretty universal one: that learning to listen with a new set of ears can help us hear more of the things that were really always there.


Appendix: Hollywood Xöömei

Throat-singing (or evocatively similar sounds) can be heard in

Disclaimers and Acknowledgements continued

I might compare my approach to that of my daughter's Suzuki violin recitals: The very youngest and newest students demonstrate how to hold the violin properly and how to bow politely, while slightly more experienced students might demonstrate simple rhythms or play variations on "Twinkle, Twinkle, Little Star". The literature becomes progressively more difficult as the students become more proficient, and the structure of the group recital reflects this. One of the possible outcomes, both for members of the audience and for the students themselves, can be to make musicianship seem more like an ordinary developmental process, and to debunk the notion that musicianship fundamentally requires a special "gift" that most of us lack.

When it comes to xöömei, I am like the child who has just learned how to hold the violin. By consulting some of the resources mentioned below, you can encounter more advanced students, but I will do my best to seem reasonably knowledgeable, entertaining, and enlightening, and I hope this brief presentation will help to demystify diphonic singing by demonstrating that "ordinary people" can, in fact, begin to learn how to do it.

On the other hand, I did say a few clever and insightful things in my talk that haven't yet been (and may never be!) transcribed here, so if you're interested in hearing an audio recording of my talk (minus the copyrighted excerpts from other people's DVDs), let me know via the e-mail address above.

For valuable insights, inspiration, and last-minute copyright permissions for my talk at BGSU, many thanks to the following ...

[return to top]