The GOOOOL? Efficient Immersive and Customizable Audio and Speech

If you’ve been watching the World Cup, you know that the “roar of the crowd” and the energetic commentators are as integral to the experience as the action on the field.

Video may be the main attraction of broadcasted or streamed matches, but the fidelity and characteristics of the audio are every bit as important. Try watching a World Cup match without it.

Many of the latest developments aren’t widely available for this round of the World Cup—some tools are still being developed and although FIFA is reportedly capturing the action with immersive sound technology, most viewers won’t see or hear that. Immersive, object-oriented, personalized and high efficiency. These are the words used to describe the latest technologies, but, for the most part, their true technological evolution can only be fully and adequately communicated by experiencing them.

New audio codecs from standards bodies such as MPEG and ETSI, and proprietary technologies, such as those from Dolby and DTS, are changing how audio is experienced from broadcasts, video streaming services and other voice applications.

Here are highlights of audio improvements:

Immersive and Object-Oriented: Immersive means that high-quality audio can be rendered with the proper spatial imaging of the content on any loudspeaker type. In other words, an at-home listening environment doesn’t require a separate speaker for each channel of audio. Immersive sound can be accomplished with fewer speakers or a small number of sound bars. There is also an additional height or overhead channel that delivers a 3D effect.

The new systems come from the MPEG-H standard (standardized with MPEG) and the Dolby AC-4 standard (standardized with ETSI). Primary proprietary systems come from Dolby (Atmos) and DTS (DTS:X). MPEG-H and AC-4 have both been adopted by the new ATSC 3.0 standard, which mandates the use of AC-4 for North America and the choice between AC-4 and MPEG-H in all other regions. The first ATSC 3.0 broadcasts are in South Korea and those broadcasters have adopted MPEG-H for their UHD broadcasts. New systems now being built in the U.S. will be developed with Dolby AC-4.

A new twist to the improvements in multichannel sound delivery is the development of object-oriented sound. Audio engineers can take components of a piece of content (such as music or dialog, etc.) and convert them into “objects.” Metadata directs those “objects” to go where they should be rendered, and the system can allow a user to control the various elements.

Personalization: The object-oriented “control” allows an individual to turn up or down the individual elements in accordance with his or her preferences. Sports programming is particularly well-suited for this technology as a viewer could adjust the balance of sound among the commentator, crowd and on-field elements. The systems also enable consumers to choose from a variety of languages or from additional commentary tracks.

Applications of these new technologies are mostly from services and media providers that present Ultra High Definition (UHD) content. Apple TV 4K recently included support of Dolby Atmos, joining Netflix and other streaming providers encoding select 4K/UHD programs with Atmos. UHD Blu ray discs are being encoded with Atmos and DTS:X technologies. For consumers to experience these audio improvements, they must be within a unified ecosystem of A/V receivers, select videogame consoles or Blu-ray players, speakers and encoded content.

High efficiency and Scalability: The core of these new developments comes from more efficient compression. There has been a long string of efficiency gains from AAC and Dolby codecs over the years. Combining pure compression gains with other advancements, such as multichannel systems and metadata, has basically created these new formats that are being adopted for select UHD programming.

But not all improvements are desired solely for high-end output. With the convergence of speech and audio in more devices, the need for one codec that goes from speech to music (and vice versa) is the focus of much of the new development. Examples include the Opus and the xHE-AAC codecs, which are scalable codecs that can be used for audio and speech at varying bit rates. Voice or speech implementations require high-quality output from low delay transmissions, while music playback doesn’t.

The goal of delivering more realistic listening experiences with greater efficiency isn’t just for UHD programming. Delivering realistic and efficient listening and speech experiences for all devices is the ultimate goal. That means improving that experience for the smartphone, headphone and the smart speaker. And whatever else comes along before the next World Cup in 2022.