A version of this blog first appeared as an article in the Australian Audiology Today Christmas edition.
One problem with Christmas parties is that there are so many of them and picking which ones to go to can be difficult. Something to influence your decision (other than the quality of the wine on offer) might be where the party is being held. The downtown club with disco music pounding away might be great if you want to dance the night away but that type of venue is not going to help you develop your network with witty conversation and one-liners. Of course, the real Christmas party challenge, even in less busy environments, is hearing and understanding what others are saying at such gatherings; a problem that is virtually insurmountable for those with even a moderate hearing loss.
The Original “Cocktail Party”
Colin Cherry was the first to coin the phrase “the cocktail party problem,” and it seems appropriate to paraphrase that term in regards to this Christmas issue. While most people reading this article have probably come across this term, not many will have the opportunity to read Cherry’s original paper – and what an interesting read it is! His brief, but very influential paper, “Some experiments on the recognition of speech with one and with two ears” first appeared in the Journal of the Acoustical Society in 1953 and is remarkable for a number of reasons.
First, in coining the term the “cocktail party problem,” the question for Cherry was “How do we recognize what one person is saying when others are speaking at the same time?” Two important ideas can be drawn from this, both of which relate to the fact that the conversational environment of the cocktail party involves multiple talkers rather than just one talker and background noise. The first idea is that some talkers will be conveying information that is of interest and also not of interest, i.e. conversation is a multisource listening challenge where focus must quickly switch between sources. The second idea is that many of the talkers’ voices will be what constitutes noise. This is important because the nature of the background sounds are important in terms of the type of masking needed to enable focusing on the sound of interest and the sorts of processing available to the auditory system to ameliorate that masking (see “A primer on masking” below).
Second, Cherry’s paper is mostly about selective attention in speech understanding, the role of the “statistics of language,” voice characteristics and the costs and time course of switching attention. In the Introduction he makes a very clear distinction between the kinds of perceptions that are studied using simple stimuli, such as clicks or pure tones, and the “acts of recognition and discrimination” that underlie understanding speech in the “cocktail party” environment. Cherry’s paper has been cited nearly 1,200 times, but interestingly enough, the greater proportion of those focused on detecting sounds on a background of other sounds used simple stimuli such as tones against broadband noise or other tones. Hardly the rich and complex stimuli that Cherry was talking about. Of course this was very much the bottom-up, reductionist approach of the physicists and engineers in Bell Labs and elsewhere who had had an immense influence on the development of our thinking about auditory perception, energetic masking in particular (See Box – “A primer on masking” and the discussion of the development of the Articulation Index).
An excellent and almost definitive review of this literature is provided by Adelbert Bronkhorst in 2000: “The Cocktail Party Phenomenon: A Review of Research on Speech Intelligibility in Multiple-Talker Conditions.” The research over that period focused on energetic unmasking. For instance: the head shadow producing a “better ear advantage” by reducing the masker level in the ear furthest from the source, the effects of binaural processing or the effects of the modulation characteristics of speech and other maskers. So, on the one hand, the high citation rate for Cherry’s paper is very surprising because there is very little in the original paper that relates to energetic masking. On the other hand, the appropriation of the term “the cocktail party problem” and the reconfiguring of the research question demonstrates the powerful influence of the bottom-up, physics-engineering approach to thinking about auditory perception. This had become the lens through which much thinking and research was viewed. To be fair though, Bronkhorst does point out in his review that there were some data in the literature involving speech-on-speech masking that were not well explained by energetic masking but that this had not been a particular focus of the research.
The turn of the century was propitious for hearing science as it marked another turning point in our thinking about this “cocktail party” problem. In 1998, Richard Freyman and colleagues reported that differences in the perceived locations of a target and maskers (as opposed to actual physical differences in location) produced a significant unmasking for speech maskers but not for noise. Such a result was not amenable to a simple bottom-up explanation of energetic masking. Thus, Freyman appropriated the term “information masking” which had been previously used in experiments involving relatively simple stimuli. This was the first time it had been applied to something as complex and rich as speech. As we shall see in more detail later, the unmasking produced in this experiment depended on the active, top-down focus of attention. As previously mentioned, Bronkhorst had pointed out that others had noted speech interference of speech understanding seemed to amount to more than the algebraic sum of the spectral energy. Indeed, as early as 1969, Carhart and colleagues had referred to this as “perceptual masking” or “cognitive interference.” Along those lines, information masking in the context of the perceptual unmasking in Freyman’s and later similar experiments came to stand for everything that wasn’t energetic masking.
Over the ensuing 15 years, many studies have been carried out examining the nature of information masking. A number of general observations can be made and some of these are drawn out in the “Primer” below. One very important shift however, was that the “cocktail party problem” became increasingly seen as a particular case of the general problem of auditory scene analysis (ASA). This is the problem of “acoustic superposition” where the energy from multiple concurrent sounds converges on a single encoder; in this case the cochlea of the inner ear. The first task of the auditory system then, is to work out which spectral components belong to which sound sources and to group them together in some way. The second task is how these now segregated components are joined up in time to provide a stream of information associated with a specific sound.
Auditory Scene Analysis
Albert Bregman did much to promote thinking in this area with the publication of Auditory Scene Analysis in 1992, marking a significant return of Gestalt thinking to the study of auditory perception. Although this part of the story is still being worked out, it is clear that much of the grouping and steaming processes underlying ASA are largely automatic, that is bottom-up, and they capitalize on the physical acoustics of sounding bodies – probably not surprising given that the auditory system evolved in a world of physically sounding bodies and “the cocktail party problem” is a common evolutionary challenge for nearly all terrestrial animals. The perceptual outcome of this process is the emergence of auditory objects that usually correspond to the individual physical sources. Indeed, many of the experimental approaches to understanding ASA involved stimuli which created perceptual objects that were in some way ambiguous and also looking at the illusions and/or confusions that such manipulation creates.
In the case of “the cocktail party problem”, the speech from each talker forms a specific stream and the problem becomes more about how we are able to select between each of the streams. In practical terms, the greater the differences between the talkers on some dimension (pitch, timbre, accent, rhythm, location etc.), the less likely we are to confuse the streams. That is, the greater stream variety, the more information unmasking we can expect.
This brings us to the key role of attention in understanding listening in a “cocktail party” scenario. Attention has been thought of as a type of filter that can be focused on a feature of interest, allowing for an up-regulation of the processing of information within that filter with a potential down-regulation of information outside the filter. A physical difference in some aspect of the auditory stream provides the hook onto which the listener can focus their attention. In recognizing the critical role that attention plays in understanding what is happening in a cocktail party scenario, it does move the discussion from “hearing” to “listening” and closer to Cherry’s goals of understanding the “acts of recognition and discrimination” that underlie the understanding of speech.
The neuroscience of auditory attention is in its infancy compared what we know about visual attention, although some tentative generalizations can be made:
Attention is a process of biased competition. The moment to moment focus of attention is dependent on competition between (1) top-down, voluntary or endogenous attentional control and (2) bottom-up, saliency driven or exogenous attention. The cognitive capacity to focus attention plays a key role in the sustained attention necessary to process the stream of information from a particular talker. There is evidence that we listen to only one auditory object at a time and selective attention is critical in enabling this. The exogenous competition introduced by concurrent sounds, particularly other talkers (the distractors) means more cognitive effort is required to sustain attention on a particular target of interest. The implication for an ageing population is that any reduction in cognitive capacity to sustain attention will increase the difficulty of understanding the stream of information from a single talker in the presence of other talkers.
Selective attention works at the level of perceptual objects as opposed to a particular physical dimension such as loudness or pitch. That is, attention focuses on the voice or the location of a particular talker (or both simultaneously – see below). While the attentional hook might be a difference on a particular perceptual dimension, the sum total of characteristics that make up the perceptual object are what becomes enhanced. Models of attention suggest that the competition for attention is played out in working memory and the players are the sensory objects contained in working memory at any particular point in time. Indeed, our conscious perception of the world relies on this process.
What this means, is when auditory objects are not well defined then the application of selective attention can be degraded. There are a number of circumstances where this can happen. For instance, when the stimuli themselves are ambiguous and don’t possess the relevant acoustical elements to support good grouping and streaming. Alternatively, the stimuli themselves may possess the necessary physical characteristics; however, poor encoding at the sensory epithelia and/or degraded neural transmission of the perceptual signal can result in a reduced fidelity or absence of the encoded features necessary for grouping or streaming. Implications for hearing impairment are that degradation of sensory encoding, such as that produced by broader auditory filters (critical bands) or poor temporal resolution, will weaken object formation and make the task of selective attention that much harder.
Attention acts as both a gain control and a gate. There is a growing body of evidence that indicates attention modulates the activity of neurones in the auditory system, not only at a cortical level but even earlier in the signal chain, possibly even at the level of the hair cells of the cochlea. In a number of recent and ground-breaking experiments, this process of up-regulation of the attended talker and down-regulation of the maskers has been convincingly demonstrated in the auditory cortex of people dynamically switching their attention between competing talkers (Mesgarani & Chang, 2012; Ding & Simon, 2013). Importantly, the strength of the selective cortical representation of the “attended-to” talker correlated with the perceptual performance of the listener in understanding the targeted talker over the competing talker.
The auditory system engages two different attentional system – one focused on the spatial location of a source and one focused on non-spatial characteristics of the source – which have two different cortical control systems. In a 2013 study, Adrian “KC” Lee and colleagues (Lee et al, 2013) had listeners change their attentional focus while imaging the brain. They found that the left frontal eye fields (FEF) became active before the onset of a stimulus when subjects were asked to attend to the location of a to-be-heard sound. This is part of the so-called dorsal attention pathway thought to generally support goal-directed attention. On the other hand, when asked to attend to a non-spatial attribute of the stimulus such as the pitch, a different pattern of pre-stimulus activation was observed in the left posterior central sulcus, an area also associated with auditory pitch categorization. This suggests that for the hearing impaired, a loss of the ability to localize the source of a sound disables or degrades a significant component of the auditory attention system resulting in an increased reliance on the non-spatial attention system.
Returning to Colin Cherry’s paper, it appears that we have — to paraphrase T.S. Eliot —“arrived where we started and know the place for the first time.”
So much of what Cherry discussed in his seminal paper is where we now find our neuroscientific focus including: the statistics of language in terms of its phonetic and semantic characteristics; the focus of attention and how that is mediated by spatial location and/or vocal or other characteristics; the transitional probabilities of what is being said and so on. The difference now is that we have both the technical and analytical tools to get a handle on how these processes are represented in the brain. With an increasing understanding of the functional plasticity of the brain, we are at a point now where we are making advances in the understanding of human perception and cognition that will have significant ramifications for how we intervene, support and rehabilitate many of the disorders that manifest as hearing impairment.
Cherry, E.C. (1953). “Some experiments on the recognition of speech with one and with two ears” J Acoust Soc Am, 25:975
Bronkhorst, A. (2000). “The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions” in Acustica 86:117-128.
Lee, A. K. C., et al. (2012). “Auditory selective attention reveals preparatory activity in different cortical regions for selection based on source location and source pitch.” Frontiers in Neuroscience 6: 190-190.
Mesgarani, N. and Chang, E. F. (2012). “Selective cortical representation of attended speaker in multi-talker speech perception.” Nature 485: 233-236.
Ding, N. and Simon, J. Z. (2012). “Emergence of neural encoding of auditory objects while listening to competing speakers.” Proceedings of the National Academy of Sciences of the United States of America 109: 11854-9.