[TES 2012 Keynote]
Encounters in the Republic of Heaven
As a composer of electroacoustic music, I have a particular interest in the human voice. When performing as a solo free improviser, I explore the outer reaches of my own voice, using only amplification. The electroacoustic work Tongues of Fire takes as its source material a short fragment of such an improvisation. I’ve also developed new methods of notating extended vocal sounds and used these to make fully notated scores for performance, such as Anticredos and the VOX cycle of vocal works for professional performers. In the studio I’ve concentrated on developing software tools and musical approaches, for organising sounds collected from the real world — traffic, birdsong and, in particular, human speech and other utterances. The signal processing software is written in “C” and forms the core of the Composers Desktop Project (CDP) suite of programs, while the Sound Loom, its independent graphic interface, is written in TK/Tcl. More details about my approaches to both sound processing and musical organization can be found in the books On Sonic Art (1996), Audible Design: A Plain and Easy Introduction to Sound Composition (1994) and the recently published Sound Composition (2012).
Electroacoustics and the Voice
The first software tools I wrote were concerned with morphing one recognizable sound into another, Developed at IRCAM in the 1980s, these tools took recordings of time-extended speech syllables (e.g., “zz”) and morphed them into similar environmental sounds (in this case, a swarm of bees), interpolating between the amplitudes and frequencies of the data in the Phase Vocoder representations of the sound spectra.
Whilst exploring these possibilities, I also discovered alternative ways to achieve convincing sound morphs from vocal sounds. For example, the vocal syllable “ko→u” can be morphed gradually into a bell sound through a series of intermediate steps. At each stage the spectrum of the source is stretched further in frequency in a non-linear way so that the simple relationships between the partial frequencies of the original sound (they are multiples of the fundamental frequency) are gradually negated and the spectrum becomes increasingly inharmonic. The bell-like result also depends on the specific morphology of the vocal syllable beginning with a brief broad-band attack (“k”) which will eventually morph into the clang of the bell, settling on a steady pitch (“o”) from which the high frequencies are gradually filtered out by narrowing the vocal cavity (“o→u”). This mimics the way in which the high frequencies of a bell’s resonance decay more rapidly than the low frequencies, as they are more quickly absorbed by the environment.
However, such processes take a certain perceptual time to evolve (so that we have sufficient time to recognize both the source and the goal, as well as the morph between them) and are not directly appropriate to dealing with the rapidly changing spectra of normal speech.
Two particular aspects are worth emphasizing about working with speech recordings. In general, creating electroacoustic music for spoken voices differs in kind from traditional scored vocal music. For notated musics (whether it’s Renaissance vocal music or opera, for example) there’s a performance tradition that determines the kind of vocal sonority or articulation that you can expect. Individual performers will bring slightly different qualities to the interpretation, but there’s a consistency of vocal quality that you can rely upon. In this situation the voice aspires to the condition of an “instrument”, something with a reliable, reproducible timbre. In contrast, amplified popular vocalists — Elvis, Mick Jagger, Björk, Janice Joplin, Lily Allen — all trade upon the unique quality of their vocal production to market their particular brand of vocality. Once we begin to deal with recorded speech in the electroacoustic domain, we are much closer to the popular music model than the classical notated tradition of composing for the voice; we are faced with the sound of a unique human being and the particular quirks of the recorded materials we have collected.
Secondly, when words and music meet in the traditional context, we are usually concerned with “setting” the words to the music — providing some appropriate sequence of pitches to complement the sequence of timbres (and meanings) provided by the text. In contrast, what interests me is uncovering the musical features inherent in spoken language and using these as the basis of sonic organization.
Previous Approaches to Working with the Voice
In a previous work, Globalalia (2004), I decided to organize the materials at the level of the syllable. Poetically speaking, the piece is concerned with what we have in common as human speech communicators. Although there are many millions of words in all the world’s languages, and one language may be incomprehensible to the speakers of a different language, all these languages are built from a much smaller set of sounds, the syllables. So Globalalia is a celebration of human speech through this shared vocabulary of sound objects.
I began by asking colleagues around the world to collect speech from local radio stations. In addition, a friend who is a language teacher had access to a worldwide array of broadcast material via her two satellite TV dishes. After this collection process, I had accumulated sources in 26 different languages and proceeded to cut these into their constituent syllables. The editing process is not so straightforward as it appears as, in real speech, syllables are part of a continuous speech stream and flow into one another. Eventually I devised a program which took into account the slight overlap between syllables when editing them apart but (as is often the case) only perfected this program when I had almost finished the task of dissecting my material. This resulted in a set of over 8300 source sounds.
To help organize this material I created a musical database for the Sound Loom, in which arbitrary properties (anything from the generally agreed “pitch” or “duration” to the personally defined “fuzzy” or “I like this”) can be assigned user-defined values of any type (numeric, verbal, codings, etc.), together with tools to search the sources for specific materials. For Globalalia, I used the properties:
- original language;
- start consonant or consonant cluster (e.g., “skr”), if any;
- vowel or vowel glide (e.g., a→oo);
- end consonant or consonant cluster, if any;
- pitch or pitch glide;
- vocal quality (e.g., shouted, raspy, breathy, etc.).
I was then able to interrogate the database to gather together sounds with specific properties, e.g.: all syllables beginning with “m” and the vowel “a”, in a particular pitch range, with gliding pitch.
The syllables have differing musical properties. For example, the syllable “ma” can be time-stretched as a whole to “mmmmmmaaaaaaa” without losing its perceived recognizability, whereas the syllable “ka” cannot be similarly time-stretched as the “k” is a transient sound (a rapid change of state) and is destroyed by the time-stretching process, and the iterative (rolled) “rr” is even more problematic. This suggests different musical processes might be appropriate to composing with the different sound objects. Hence Globalalia is essentially a set of studies worked on different sets of syllables (e.g., “ma” and Dutch “rrr”, or the attacked sibilants “ts”, “ks” and “ps”) and is bound together in a rondo-like structure by a thematic utterance constructed from a wide variety of syllables taken from many languages and individual speakers, which recurs, with variation, at key points in the piece. This may sound a little dull, but this is not the case — listen to the following example (the “pi”, “pa”, “bo” etc. study).
The slightly humorous plucked-string-like sounds towards the end result from time-stretching the very short vowels of these syllables, which are unstable in pitch, so that we hear gliding tones once they are time-stretched.
With each musical project I like to tackle some new technical challenge, as I enjoy this aspect of composing. In Globalalia, the main technical innovation was a set of programs to time-stretch vocal iteratives, like rolled “rr” sounds or vocal grit. The technical problem arises from the fact that iteratives are sequences of attacked events. As we’ve discussed, the “k” in “ka” or the “p” in “pa” cannot simply be time-stretched if we are to preserve their “k”‑ness or “p”‑ness, but we could use a time-stretching function that preserved the initial “k” or “p” part of the sound but then time-stretched the “a” tail. This is what is happening in the plucked-string-like sounds above.
Unfortunately, with iteratives we have a whole set of attacks and would need to define a stretching function which varied with every attack in the source. Also, we would then end up with a sound like “r‑‑‑r‑‑‑r‑‑‑r‑‑‑r‑‑‑”, a series of widely time-separated tongue flaps, which is not really what we want. Rather we want to convert “rrrr” into “rrrrrrrrrrrrrrrr.” Superficially it would seem obvious merely to form a loop of our short set of tongue flaps and repeat the material. Unfortunately, perceptually speaking, this does not work. In fact, when I first did this and played back the resulting sound, I thought I had played the wrong sound as the result was utterly unlike the source — it sounded self-evidently synthetic. It seems that the human brain has a very good exact-repetition detector which causes it to categorize sounds, quite spontaneously, as “unnatural” or “implausible”, so our new sound is not heard as related to our natural source. The tongue flaps in a natural rolled “r” are all of slightly different loudness, slightly irregularly timed and slightly different in sound quality. To create a plausible extension of such a sound we need to preserve this subtle randomness.
So the new process first searches for the attacks in the sound (which repeat approximately every 50 milliseconds) with an appropriately sized envelope-tracking window (about three times smaller than the typical flap duration). This gives us an indication of where the tongue flaps are. The process then searches away from the peaks of the flaps to find the minimum energy troughs between the events and finally cuts the flaps apart at (identically oriented) zero crossings. (The zero cuts avoid introducing splices into the sounds, which might subtly alter the sonority at this time scale). We then reconstruct our source, using random permutations of this edited set of flaps. For example, with flap sequence abcd, we generate random permutations of the order — e.g., dbac, cadb, etc. — and then join these together avoiding repetitions between permutations — e.g., abcd-cadb-dbac is fine, but abcd-dbac-cabd is not, as it gives us a repetition of flap c. In this way, we preserve the randomness of amplitude, timing and sonority of the original and the resulting sounds are completely plausible — they appear to be source recordings themselves. In the following example, the first sound of each set is the source and the ensuing sounds are derived via this special time-stretching process.
Once, however, we have achieved this plausible extension we can go on to develop the material in more radical ways — slowing the tempo, synchronising the events in different streams, focusing the pitch quality of the material through filtering, moving the streams in space — creating an apparently surreal sound-landscape, where the listener might imagine events recorded in an unfamiliar world rather than the output of a synthetic mechanism.
There are two important things to note about working with speech at the syllable level. First of all, we have eliminated the meaning or narrative content of the material — we don’t need to deal with this at all, although certain expressive aspects of speech utterance persist in the syllables themselves and contribute to the way the music is organized. Even more significantly, we have eliminated the perception of individual speakers, dissolving language in a kind of universal Ur-speech where individual human utterances are subsumed in a music of sonority.
A different approach to spoken language is used in The Division of Labour, where the text plays the sonic equivalent of a melodic subject in traditional music. Instead of a series of pitches which we can then transform — by transposition, time expansion or contraction, modal change from major to minor and so on, broadly conserving the original sequence — we have a series of sound objects, the syllables, which can form a similar template for variation. The words here are taken from Adam Smith’s The Wealth of Nations, one of the sacred texts of our materialist culture. 1[1. The full title is An Inquiry into the Nature and Causes of the Wealth of Nations.] They describe the division of labour in an Edinburgh pin factory and are sufficiently significant to appear on the back of the British £20 note.
Adam Smith is concerned to show that this process leads to an enormous increase in industrial productivity. However, a side effect of the process is that work becomes monotonous and unsatisfying. The piece, therefore takes the original text, spoken in a Scottish accent by Alex Gordon, an old friend of mine, and then “divides the labour,” generating a diverse set of variations on the original recorded material, in a sense musically demonstrating the efficacy of the division of labour. Each variation preserves the order of phrases in the original text, even where the transformations are extremely radical. Towards the end of the piece, however, the original text recurs in its original setting but the syllables within each phrase have been scrambled so the text loses all meaning.
“Encounters” and the Music of Speech Phrases
For the piece Encounters in the Republic of Heaven, I wanted to explore musical features of natural speech which only become apparent on a larger time scale, at the level of the spoken phrase — pitch contour (or melody) and the implied harmonic field; tempo, meter and rhythm and, especially, the sonority of individual voices. We are all able to instantly recognize a large number of individual speakers from our friends and acquaintances; would it be possible, by some clever technical process, to extract and musically distil the essence of an individual human voice? A second aim was to somehow represent the musical diversity of human speech across an entire community, and for this reason I decided, for the first time, to work in 8‑channel surround sound so that the community of speakers would surround the audience. As a result, as the project developed, I spent a good deal of time extending almost all the CDP software to work in a multi-channel context and developing new sound spatialization tools (for example, to rotate the entire frame of an 8‑channel scene), new approaches to reverberating or texturing over surround sound and an environment to allow files with from one to sixteen channels to be mixed together in any conceivable spatial distribution.
The idea for this project had been in mind for a long time, but collecting recordings of natural speech is not straightforward. You can’t simply wander up to someone in a pub, tell them they have a very interesting voice and switch on a recorder! I imagined I would make such recordings in my home region of Yorkshire, where I would more easily be accepted as a regular, if slightly mad, member of the local community, but finding the situations where recording might be appropriate and establishing sufficient rapport with the recordees needs a good deal of organization and local knowledge, plus high quality portable recording equipment and transport (I don’t drive) and of course, the finance to proceed with the project. I could not see a way forward with this idea until the post of Composer Fellow was advertised at the University of Durham, in the North East of England. This was not my home region, but sufficiently close and sufficiently similar in industrial culture, to attract my interest. In addition, the Durham post required input to the local community and experience in electroacoustic music. This seemed an ideal opportunity so I applied for and was appointed to the three-year post and proceeded with the project, with Durham as my base.
The first year of the project was largely taken up with establishing contacts in the community through existing local organizations — local government leisure services, schools, dialect societies, community arts projects, old people’s centres, local poets and so on — arranging meetings and making recordings. I wanted to capture a cross-section of human voices in the community, men and women of all ages; the youngest person I recorded was 4 years old and the oldest 93. In schools, I ran composing workshops for the children, where they composed their own pieces based on speech rhythms, as a quid pro quo for my recording work. In the recording sessions the youngest children talked freely, but teenagers were more problematic. One approach was to ask the headmaster to send me those children who talked too much in lessons (!). Faced with a microphone, however, many teenagers were suddenly tongue-tied. Only by getting together a group of at least three kids could I guarantee animated speech. However, the problem now arose that they interrupted one another’s utterances. Hence, separating the end-overlapped phrases became a technical challenge and I developed new spectral cleaning processes, integrating them with existing CDP programs, to allow me to do this in a sufficiently detailed way. Recording older people proved easier as there were many reminiscence groups or discussion gatherings where people were happy to talk to others about their life experiences and to share their opinions. In other situations, providing a relaxed atmosphere for natural conversation to occur (rather than an interview situation or a studio setting) meant that the incidental noises of a house, a crackling fire or a pub had to be removed from the recordings.
In this kind of project it’s not possible to choose in advance the exact vocal characteristics of the people who will be recorded, so many more voices were recorded than were finally used. Only after the recordings had been gathered could choices be made about the materials. Firstly, I wanted, in some way, to represent the diversity of human beings through the diversity of the sounds of speech. So as well as representing each gender and age band, I needed voices with contrasting sonic qualities to present a diversity of sonic substance. Secondly, I had to decide which recorded phrases I wanted to use in the piece — typically I would have one to two hours of recordings of a voice but would end up using no more than two minutes of material. Finally there was the “radio programme editing” of the materials, removing hesitations, word or phrase repetition and various glossalalia (but in some cases collecting these together for musical use).
Cataloguing the Materials
In order to keep track of the material, I extended and developed the tools associated with the source database, allowing me to enter melodic contours and rhythmic shapes (graphically), and texts spoken as properties of the sounds. Associated programs allowed me to statistically analyse the melodic / harmonic, textual and rhythmic content of the materials, and search for common melodic shapes or harmonic content, common tempi, or common words or sub-words (e.g., “mem” in “remember” and “memory”).
This was not entirely straightforward. For example, I developed ways to search texts for similar sounding word starts and ends, but the bizarre nature of English spelling made word endings particularly problematic. I also developed a tool to graphically enter motivic (pitch timing) information for each vocal phrase. I had already written software to directly track the pitch of speech (allowing for interpolation over silences and pitch-free sibilants). In general, the pitch contours of speech are free of lattice restraints (the pitches don’t lie on some pre-existing lattice like the tempered scale) and research suggests that when we first hear speech we are not (consciously) aware of its melodic shape. However, if a speech recording is immediately repeated we do tend to pick up a melodic contour (and with a third repetition the melodic effect is even more pronounced). My own experience suggests that we approximate these melodic patterns to the lattices with which we are familiar (they appear to fall somewhere close to the tempered scale, at least to listeners immersed in tempered-scale musics). But because the speech lines are not strings of steady pitches, there is a certain fluidity to how we assign scale pitches to the speech syllables. Finally, for certain parts of the piece I wanted to be able to bring into tune the speech materials of different speakers, so I needed these tempered approximations to fall on the same tuning grid. (e.g., it would be possible in principle for two melodies to each lie on a tempered scale, but the two scales to be a quarter-tone apart, so they would not harmonically gel). Thus I tracked the pitch contour of each speech phrase to a concert-pitch tempered-scale approximation.
Statistical analysis of the tempi of phrases (not of complete sentences) threw up an interesting pattern. One can plot the tempo against the number of phrases that have that tempo (the phrase population) and one might expect a graph that starts at zero (no phrases are spoken) where the tempo is too slow for normal speech, rises gradually towards some average value (typical speech speed) and then falls away gradually to zero (where speech phrases would be too rapid to understand or even to articulate). Surprisingly, however, the graph turned out to have prominent peaks (lots of phrases at these tempi) at crotchet = 120 (dance music tempo), crotchet = 180 (triplets in the same tempo) and a lesser peak at crotchet = 144 (symphonic Allegro)!! This proved useful when choosing phrases to synchronize in the rhythmic section of Act 1 (see below).
My original idea to extract the essence of the sonority of an individual voice proved impossible in practice — what characterizes an individual voice is too complex a confection of tempi, hesitation types, melodic tics and glossalalia to capture easily with any technical procedure. But some voices had such strong characteristics (nasality, grittiness or cross-register breaks in the pitch contour) that I was able to develop these further.
The selection, sifting, cleaning and cataloguing of these recordings took up a further 18 months of the project.
Aesthetic and other Contraints
Working at the level of the phrase, two new issues arose. First of all, it was no longer possible to ignore the narrative content of the speech. In most of the scored vocal music I have composed, I have used invented languages (with no explicit semantic content) to allow me complete control of the sonic content. Here, however, there was no way to avoid dealing with the narrative, so the piece combines story-telling and sound art, reshaping the telling to musical demands.
Thus, in addition to the radio-style editing (see above) the stories are slightly reshaped, e.g., key phrases are repeated as refrains, in the manner of simple poetry, without altering the substance of what is being said.
Secondly, it was no longer possible to ignore the presence of real individuals in the recordings. It’s often said that native Americans originally objected to being photographed because the process was thought to capture something of a person’s soul. If this is true of photography, it is even more true when recording a person’s voice. If I then take these recordings and develop and manipulate them, ethical issues are involved. If this were my own voice, the voice of a willing friend or a professional musician, I would be happy to stretch and warp it in any way that seemed musically necessary. However, most of the people I recorded were non-professionals who would not necessarily have any knowledge of or interest in the esoteric aspects of contemporary music. I therefore felt that I could not treat there voices with complete abandon — there would need to be constraints on the processes I applied to the voices, not unduly warping or disfiguring the original speech. However, all composing involves working within restraints, so this was not a major æsthetic problem.
There was no way to avoid dealing with the narrative, so the piece combines story-telling and sound art, reshaping the telling to musical demands.
Over the long preparatory period, the final form of the piece gradually took shape. It is divided into four Acts of approximately 20 minutes each. Acts 1 to 3 have four portrait sections based on an individual speaker or a group of children, telling stories. These are presented mainly in wide stereo. Each Act also has a central interlude in 8‑channel surround sound, using a multitude of speaking voices but organized differently in each case: in Act 1, in terms of their tempo and rhythm; in Act 2 in terms of sonorities (syllable or phoneme qualities) in the text; and in Act 3 in terms of the harmonic field of groups of spoken phrases. And each Act also has a surround sound finale, in which materials previously derived from the voices (in earlier sections of the Act) are developed more freely (they are no longer so strongly tied to the narrative). Act 4 has just two portraits and then reworks the materials from previous Interludes and Finales, culminating in the transformation of the speaking voices into song (see below).
The whole work is bracketed by the “voicewind” sound, an extremely dense texture of voices where all vocal detail is lost; what remains is a band of noise which judders over the eight loudspeakers like the sound of strong wind blowing around one’s ears. At the opening of Act 1 (and Act 3) the texture gradually thins to reveal the voices, while at the end of the piece, the texture of speaking voices gets denser and denser, returning to the sound of wind with which the piece begins.
Having extracted and refined the text materials, I had to decide how to create each portrait. Perhaps the most obvious way to track the melodic contour of speech is to use other musical instruments to imitate the melodic line, and in the first example of the portraits, this is what I’m doing. However, this is the only place in the piece where I use sources not derived from the recorded speech. Here, the male narrator recalls going to a beer festival dressed as a belly dancer and dancing with various men who mistake him for a woman. Three brass players perform a rhythmic figure behind the speech, picking up prominent melodic motifs from the speech line such as “you turn up in fancy dress,” “and as the night went on,” “had to take the yashmak off” and particularly “It’s a bloke!”
In another portrait (The Dancer’s Tale) the pitch contour of the spoken voice is tracked using a filter (approximated to the tempered scale) which is applied to the voice line at the original moment-to-moment pitch and all of its harmonics. With a low Q, this merely adds a warm resonance to the speech line and nudges it towards its tempered-scale approximation. With high Q, the strong speech markers (e.g., the sibilants) are obliterated and we are left with the pure pitch contour of the speech line. Other filters are used in banks which reproduce the entire harmonic field of a particular phrase, filtering the whole phrase. In fact, all portraits use a wide variety of approaches to their material. Here, for example, you will hear vocal hesitations and glossalalia, or vocal sibilants, gathered together in textural groupings.
I’ve previously mentioned the various difficulties encountered in recording teenage kids. One further problem arises in that, once they are persuaded to chat they tend to talk about various personal or embarrassing things which they probably would not want to be broadcast to the world (which might include their peers, or their parents!). I therefore needed a way to capture the excitement of “gossiping” without revealing the slightly risqué content. To do this I used an envelope follower with a large window set to recognize individual vocal syllables, then retained the centre of each syllable, discarding the onset and tail. The syllable cores were then rejoined in a rapid fixed-tempo stream. This reconstruction maintained the pitch contour, the vowel stream and the expressiveness of the speech line (e.g., laughter is perceptible) while completely disguising the semantic content. This material is then juxtaposed with clear text utterances (e.g., “ginger hair!!”) with different processes applied.
The next example illustrates one of the more successful attempts to work with the actual sonority of an individual speaker. The speaker is a 93‑year-old woman who lived on a remote farm in Upper Teesdale. Her voice has distinct cross-register breaks, both up and down and often by the interval of a fifth, particularly when her speech becomes animated. I extracted many examples of the cross-break articulations and used the pitch-tracking filters, time-stretching and other approaches to develop articulated events like the individual notes of a bagpipe-like musical instrument. This voice-derived instrument is then used to accompany the voice.
So far, the examples preserve quite closely the authentic voice and narrative of the speaker. The next storyteller is an experimental poet and because of this I felt I could take more liberties with the treatment of her voice. She had lived in both Liverpool and Newcastle and had a striking accent and a strongly nasal intonation. Various techniques are used to extend and develop the speech. Vowels in the phrase “Heathcliffe come here!!” are extended in time by a new process which recognizes the individual wave packets in the vocal stream (more details below), while the syllables of the word “democracy” are repeated, permuted (in fact, using patterning from English bell-ringing practice) and simultaneously gradually spectrally morphed, becoming more bell-like with time.
The Multi-Channel Sections
Materials derived from the two voices we have just heard are worked together in the finale of Act 3, illustrating the more abstracted, less narrative-based character of these finale sections. Syllables of the older woman are time-extended and float over the texture sometimes sounding like horns, and vocal flutters in her articulation are extended into fluttering events. The spectral morph of the democracy syllables from the poet’s voice are developed by both time extension and stacking (copies of the source, resampled at different rates and therefore different durations and pitches — in this case, at octaves — are superimposed in such a way that their attacks synchronize precisely) to produce giant bell-like attacks.
The central sections of each Act take materials from all the speaking voices and present them in surround sound so that the audience is enveloped in the community of speakers. Each act treats this collection differently. In Act 1 the tempi and rhythms of spoken phrases are coordinated. Using the statistical information from the database, I was able to select vocal phrases of the same tempo and carefully synchronize these in the 8‑channel mix. However, no matter how carefully this was done, the result sounded simply like a crowd. Only by making very subtle changes to the timing within phrases — changes so small that, in most cases it is not possible to tell the difference between the original and the time-modified phrase when played back-to-back — was I able to achieve the rhythmic locking of the various voices. I tried various approaches to time modification but in the end the simplest — deleting tiny slivers of sound at the lowest energy points between syllables or inserting tiny slivers of silence between syllables — proved the most effective.
Various rhythmic / spatial approaches are used — the initial phrases put successive syllables on different, adjacent channels of the 8‑channel ring, so the speech phrase jump-circles around the audience. But for the most part, complete short phrases are placed on single loudspeakers, with some later use of echoes falling in the driving tempo, tutti accents occur on several or all channels simultaneously whilst some sustained sounds or textures pan in a circular fashion around the space. Near the end, clipped syllable fragments are worked in double tempo.
In Act 3, the speech is coordinated in terms of its melody and implied harmonic field. As you might imagine, much speech lies in a narrow pitch band, generating chromatic clusters of notes as its associated harmonic fields. This may be interesting to hear once or twice but could become musically tedious. I therefore used the database search facility to find spoken phrases containing larger pitch intervals (a minor third or greater) and to correlate these with the same pitches and pitch intervals in other phrases, allowing me to gather these materials together and generate interesting harmonic progressions between the groups of speech phrases themselves. Filters resonating at the pitches (and harmonics) of these harmonic fields amplify the harmonic resonance, and sometimes these resonances float away from the vocal sources in the surround space.
Speech into Song
In Act 4, the various threads of the previous acts are drawn together leading to the finale in which the speaking voices burst into song. In order to achieve this I developed ideas from the iteration extension described earlier. Spoken (voiced) vowels consist of small wave packets. The speed at which the wave packets come past determines the pitch of the voice, while the form of the wave packet determines the vowel we hear. Using a very tiny envelope window we can detect these wave packets, in a parallel fashion to the detection of the tongue flaps in an iterative (rolled) “rr” sound, but on a much smaller time scale. There are several difficulties to overcome. The first is that the packet size changes with pitch, so we can no longer use a fixed-sized envelope window to tracks the wave packets. We must start with a pitch-detection procedure (I use harmonic peaks detection on the phase vocoder data) and then generate an envelope window about one third of the wavelength of the perceived pitch. We also have to deal with all the unpitched (and silent) events in the speech stream in some rational way. Once we have detected the packets, a new problem arises if we want to change the pitch. In real speech, if the pitch of a given vowel (lets say “aa”) goes down by one octave, the packet becomes twice as long, so to transpose the original voice down by the same octave we have to devise some means of extending the packet so it is “the same shape” yet longer. It wasn’t clear to me how this could be achieved in a perceptually plausible way. However experimentation revealed that simply by inserting silence between the packets and thus changing their timing, not only did the pitch of voice fall but the vowel quality was perfectly preserved. Even transposed two octaves down, where the new signal was three-quarters silence, a plausible — though at this transposition slightly gritty — vocal “aa” was produced. Similarly, by overlapping the packets and thus shortening the time between them, the voice could be made to rise in pitch. Due to the signal overlap involved here, beyond around one octave up the signal began to become implausibly resonant with the internal echoes involved. However, this was enough pitch play to allow me to develop spoken vowels into sung lines, adding random-varied vibrato of the packet repeat rate for a more expressive cantato (the sound without vibrato was heard earlier in the “Heathcliffe” examples).