top of page
Search

3. What is Binaural Audio?

Updated: Jun 23, 2022

What is Binaural?

‘Binaural’ is a technical term in psychoacoustics that describes how we, as humans, hear. To break it down, ‘bi’ means two, and ‘aural’ means hearing, and these two terms put together describes the function of the two ears either side of our head.


Similarly to eyesight, having two ears allows us to have a greater perception of space and distance, but unlike eyesight our sense of hearing is completely three-dimensional. This means that we can perceive sounds from in-front, behind, above or below us.


The benefit of binaural hearing is that we are also able to determine where that sound is coming from with a high level of accuracy (±1° for sources in-front of us) (Howard & Angus, 2009). This is because our brains are excellent at determining slight differences between the sound arriving at the left and right ear. Subconsciously, our brains are continuously using binaural ‘cues’ (differences in the sound in the left and right ear) to scan our surrounding environment.




Over the past decade, the term Binaural has appeared more in technology and media. This is because media platforms such as music, TV, and gaming are utilising binaural audio to offer a more immersive experience – this is known as ‘Binaural Synthesis’, as it involves a very specific processing technique that makes recorded sounds appear as though they are coming from a specific point in space.

ree

Unlike traditional two-channel stereo audio, binaural synthesis is able to create a virtual ‘sound field’, where sound sources can be made to sound like they are coming from any point in space, outside of the listeners head.


Traditionally, this experience is best optimised for headphones, and with applications such as VR and headphone listening becoming more popular, alongside a dramatic growth in mobile computing power, binaural synthesis is a rapidly expanding area of audio technology.


The rest of this blog will cover the theory behind binaural audio, as well as explore some of the perceptual limitations of binaural synthesis.


Binaural Cues

There are three main binaural cues that ours brain use to locate sounds: Interaural Intensity-Difference (IID), Interaural Time-Difference (ITD), and Spectral Filtering from the Ear Pinnae, Torso and Head (Shimoda, et al., 2006).


ILD (Interaural Level Difference)

Interaural Intensity-Difference describes the amplitude difference of a sound source arriving at each ear. The intensity difference is created by the absorption and diffraction of the head, which is also know is as the ‘head shadowing effect’ (Rumsey & McCormick, 2014). When a source is directly in-front of you there is no ILD, however, as the source moves around the head, the IID increases up to around ±20dB at ±90° (Howard & Angus, 2009).


ITD (Interaural Time Difference)

Interaural Time-Difference describes the time difference in the arrival of the sound source at each ear. The Time-Difference is due to the distance between each ear and like ILD, the ITD varies based on the direction of the sound source. The ITD ranges between 0-0.673ms, given by Woolworth's formula that estimates the head as an 18cm wide sphere (Howard & Angus, 2009). Woolworth's formula below is similar to the ITD formula seen in the previous blog on ORTF Recording.


ITD = r(θ + sin(θ))/c


Where ‘r’ is the half the width of the head (0.09m), ‘θ’ is angle of the sound source, and ‘c’ is the speed of sound (344ms-1).


ree

Spectral Cues

Spectral Cues are provided by filtering, and reflections caused by the outer ear and upper body. This makes the source sound slightly different in each ear. The amount of filtering changes slightly depending on where the sound is coming from, and our brains are tuned to these filtering profiles to help determine whether a sound is coming from in-front or from behind, and above or below us (Macpherson & Middlebrooks, 2002; Shimoda, et al., 2006).



The Cone of Confusion

The ‘cone of confusion’ is a term used to describe a cone around the listener in which all IID and ITD values are equal. In these situations the perceived direction of the source may be confused with any location around the cone. (Queiroz & Sousa, 2011).


Spectral cues are therefore very important in reducing localisation ambiguities, as the filtering profiles around the cone of confusion are not symmetrical. In real-life we rotate our heads slightly to improve the accuracy of the spectral cues, . You may have noticed that dogs do this a lot too.


ree


Binaural Synthesis and Object based Audio

Binaural sound recording techniques are one of the methods used to create binaural audio. The ‘dummy-head’ is the most popular method, in which two microphones are positioned inside the ears of a model head.

This technique has existed for almost 100 years, and the binaural audio is created by hard-coding the binaural cues into the recording (Roginska & Geluso, 2018).


There are several cases in which this technique is useful, such as live recording, and binaural impulse response measurement (BRIR), however, this technique does not allow you to adjust the direction of the sound after recording (3D panning).


In interactive media, it is important to be able to adjust the direction of a sound source in real-time. For instance, in video games, the sound source direction is determined by both the movement of the player and the sound source. Thankfully, there is an alternative method of Binaural Synthesis that allows you control the direction of a sound source in real time. This is commonly known as ‘Object-Based Audio’, using ‘Head-related Transfer Functions’ (HRTF’s) (Tsingos, 2017).


In object-based audio, a sound source is represented as a single sound ‘object’. In essence, this just means that the sound is a high quality monophonic (single channel) recording. Monophonic sound objects can then be panned in space virtually using HRTF’s (Queiroz & Sousa, 2011). This topic will be covered further in the upcoming post on Object-Based Audio.


What is an HRTF (Head-Related Transfer Function)?

ree
Measuring HRTF's (Armstrong, C. et al, 2018)

HRTF’s, also known as HRIR’s (Head-Related Impulse Response), are sets of ‘Impulse Responses’ recorded inside each ear of a human head.


What is an Impulse Response?

An Impulse Response is just a recording of a really short blast of noise inside an environment (around 0.01seconds). If you have ever used a convolution reverb, then Room Impulse Responses are used to make it sound as though a sound is in that room.


In object based audio, an HRTF is able to make a sound appear to come from a specific direction, as the HRTF contains all of the binaural cues.


HRTF’s are obtained by placing small microphones inside a persons ears, and then recording noise bursts played around them from different directions using a dome of loudspeakers. A single HRTF contains the response from a single source direction. The full set of HRTF’s will contain responses from all around the head.


HRTF's in Gaming

The latest generation of PlayStation and Xbox both have custom hardware processors that are purpose built to render 3D audio. Whilst Xbox offers some third-party audio formats, including Dolby Atmos, PS5 has focused on its own system, Tempest 3D AudioTech. These are the first examples of modern gaming consoles integrating purpose built processors for 3D audio.


In the 'The Road to PS5' presentation, System Architect Mark Cerny, gives a talk below about how they measured HRTF's for use in their Tempest Audio Engine.

Spider Man Miles-Morales was one of the first PS5 console title releases, and unsurprisingly PlayStation took the opportunity to show off the full extent of their Tempest Engine. The use of 3D audio in this game massively heightens the sense of presence, whether this be busy city ambiences, or enemies appearing in combat - however, even the more subtle scenes, such as Miles walking around his home, are brought to life with binaural audio.

In Resident Evil Village, the suspense is given greater impact by the implementation of binaural audio. As you navigate through this game you use sound localisation to discover new items, determine the direction of enemies, and feel a sense threat and foreboding as the environment around you creaks and shudders.


The use of binaural audio in this game is perhaps one of the strongest examples I could think of, mainly because of the scarceness of sound. Although, this may not sound particularly complementary, it has been found that localisation performance decreases with an increasing number of competing sound sources (Brungart, D. S., Simpson, B. D., & Kordik, A. J, 2005; Roginska, A. and Geluso, P., 2018). Therefore, binaural audio thrives in situations where there are fewer sound sources.


Issues with HRTF’s

The problem with HRTF’s is that the everyone has different ear and head shapes, which means that everyone has their own unique way of hearing. This provides one major problem in making binaural a viable and accessible commercial product, as there is no one-size fits all approach to HRTF’s.


The Kemar microphone (shown above) is a popular dummy-head that is used for creating generic HRTF’s, as it modelled from average ear dimensions- however, these are still far from perfect. Poorly matched HRTF’s often lead to issues in the perception of sound sources, including hearing the direction of sound back to front (front-back confusion), and feeling like the sound is coming from within the head instead of outside of the head ('Poor Externalisation') (Guezenoc & Séguier, 2018).


The most obvious approach to this problem is to take personal HRTF sets for people. This however is clearly not practical, as gathering HRTF’s is a time-consuming process that requires specialist equipment. Therefore, major companies including Sony, Dolby, and Steinberg have been exploring alternative ways of generating personalised HRTF’s for consumers.


Some methods involve pairing HRTF’s from large datasets to a individual using photographs of their ears; numerically generating HRTF’s from a 3D scans of the head; and allowing the individual to choose appropriate HRTF’s based on perceptual feedback (Guezenoc & Séguier, 2018).


Although the engineers at PS5 recorded hundred's of HRTF's when developing their Tempest Engine, in the console players can only choose from 5 different sets when adjusting their audio settings. You may be thinking that surely 5 sets of HRTF's are too few to suit their 20 million players (give or take) that own a PS5. In truth, you would be right - however, as Mark Cerny mentions at the end of his presentation:

"What HRTF you are using is key...that means HRTF selection and synthesis are big research topics going forward as the Tempest technology matures" - Mark Cerny

Reflective Summary - Driscoll's What Model (Driscoll, 2007)

What?

This week I have explored the theory of binaural audio and binaural synthesis, as well as its use cases in interactive media. Following from the last post on ORTF, the theory of the ITD and ILD binaural cues relates both the ORTF and Binaural techniques together. I also discovered some exiting ways that companies are addressing individualised HRTF's, however, as was discussed by Mark Cerny, these methods are still very much in their infancy.


So What?

Based on the perceptual issues that still exist within commercial binaural audio, I feel that there is still a great opportunity to develop a 3D renderer that will generate a close to binaural spatial experience, without compromising the spectral quality of the sound. Also, now that I have learnt more about the development of HRTFs, perhaps the first step in the development of the virtual ORTF renderer could be to develop a set of Transfer Functions for ORTF? An ORTFTF?


This would be practical for several reasons:

  1. I could use the ORTFTF measurements to accurately calculate ITD and ILD around both the horizontal and vertical axis

  2. I could use open source software such as 'Resonance Audio' to test the ORTFTF inside of game engines.

  3. I could physically adjust the ORTF microphone configuration to trial different stereo microphone configurations based on the Williams Curves discussed in the ORTF post.

What Next?

The similarity in ITD and ILD between ORTF and Binaural seems similar, however, I am not yet aware of the ITD and ILD behaviour or ORTF between 0 and 90 degrees. The next post will investigate sources at arbitrary source angles recorded by an ORTF microphone are perceived through headphones and loudspeaker arrangements.


Conclusion

Binaural is a psychoacoustic phenomenon that allows us to hear the natural world in 3D. This is due to binaural cues that exist as result of the shape and spacing of our two ears. In modern audio technology and media, binaural processing can be applied to audio to make sounds appear as though they are coming from a specific point in space.


Traditionally binaural audio was achieved by recording sounds using a dummy-head, however, object based audio allows sounds to be positioned anywhere in space, and perceived using HRTF’s (Head-Related Transfer Functions).


Although HRTF’s allow the position of the sound to change in real-time, they can cause poor externalisation and localisation accuracy if they are not well paired to an individual. Therefore, obtaining individualised HRTF’s is a major area of ongoing research.


References

Armstrong, C. et al. (2018) ‘A Perceptual Evaluation of Individual and Non-Individual HRTFs: A Case Study of the SADIE II Database’, Applied Sciences, 8(11), p. 2029. doi:10.3390/app8112029.

Brungart, D. S., Simpson, B. D., & Kordik, A. J. (2005). Localization in the presence of multiple simultaneous sounds. Acta Acustica United with Acustica, 91(9), 471–479.


Driscoll, J. (ed.) (2007) Practicing Clinical Supervision: A Reflective Approach for Healthcare Professionals. Edinburgh: Elsevier.


Freeland, F. P., Biscainho, L. W. P. & Diniz, P. S. R., 2002. Efficient HRTF Interpolation in 3D Moving Sound. Brazil, s.n.


Howard, D. & Angus, J., 2009. 2.6 Perception of Sound Source Direction. In: Acoustics and Psychoacoustics: Fourth Edition. Oxford: Focal Press, pp. 107-119.


IRCAM, 2005. IRCAM HRTF Database. [Online] Available at: http://recherche.ircam.fr/equipes/salles/listen/download.html [Accessed 30 October 2021].


GRASSound&Vibration (2022). Available at: http://kemar.us/ (Accessed: 23 May 2022).

Macpherson, E. A. & Middlebrooks, J. C., 2002. Listener weighting of cues for lateral angle: The duplex theory of sound localization revisited. The Journal of the Acoustical Society of America, 111(5), p. 2219–2236.


Marcelo Queiroz, G. H. M. d. S., 2010. Structured IIR Models for HRTF Interpolation, Brazil: Computer Science Department, University of Sao Paulo.


Mattes, S., Nelson, P., Fazi, F. M. & Capp, M., 2012. Towards a human perceptual model for 3D sound localization. Southampton, s.n.


Noisternig, M., Musil, T., Sontacchi, A. & Holdrich, R., 2003. 3D binaural sound reproduction using a virtual ambisonic approach. IEEE International Symposium on Virtual Environments, Human-Computer Interfaces and Measurement Systems, Volume 3, pp. 174-178.


PlayStation (2020) The Road to PS5. Available at: https://www.youtube.com/watch?v=ph8LyNIT9sg (Accessed: 25 May 2022).

Pulkki, V., 1997. Virtual Sound Source Positioning Using Vector Base Amplitude Panning. JAES, 46(6), pp. 456-466.


Queiroz, M. & Sousa, G. H. M. d., 2011. Efficient Binaural Rendering of Moving Sound Sources Using HRTF Interpolation. Journal of New Music Research, 40(3), pp. 239-252.


Roginska, A. and Geluso, P. (eds) (2018) Immersive sound: the art and science of binaural and multi-channel audio. New York ; London: Routledge, Taylor & Francis Group.


Rumsey, F., 2001. Spatial Audio. 1st ed. Oxford: Taylor & Francis Group.


Rumsey, F. & McCormick, T., 2014. Sound and Recording : Applications and Theory. 7th ed. Burlington MA: Routledge.


Shimoda, T. et al., 2006. Spectral Cues for Robust Sound Localization with Pinnae. Beijing, s.n.


Tsingos, N., 2017. Object-Based Audio. In: Immersive Sound. The Art and Science of Binaural and Multi-Channel Audio. Oxfordshire: Routledge, pp. 244-273.



Comments


bottom of page