вход по аккаунту


код для вставкиСкачать
Patent Translate
Powered by EPO and Google
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
Abstract The described embodiments include individually addressable at least one driver
configured to emit sound waves towards one or more faces in the listening environment for
reflection to a listening area in the listening environment. A driver and a renderer configured to
receive and process an audio stream and one or more metadata sets associated with each of the
audio stream and specifying a playback position of the individual audio stream; Or a playback
system configured to render the audio stream into a plurality of audio feeds corresponding to the
array of audio drivers according to a plurality of metadata sets; and a system for rendering audio
content based on an object through the system
System for object-based audio rendering and playback in different listening environments
One or more embodiments relate generally to audio signal processing, and more particularly to
systems for rendering adaptive audio content through individually addressable drivers.
The subject matter discussed in the background section should not be considered to be prior art
merely as a result of the background section.
Similarly, the issues mentioned in the background part or in connection with the subject matter
of the background part should not be considered as conventionally recognized in the art. The
subject matter of the background art section merely represents a different approach that may
itself be the invention.
Movie soundtracks usually have many different sounds corresponding to sound effects that
combine with background music and environmental effects emanating from different locations
on the screen to generate on-screen images, speech, noise, and overall viewing experience It has
an element. Accurate reproduction requires that the sound be reproduced to correspond as
closely as possible to what is shown on the screen in terms of source position, intensity, motion
and depth. Traditional channel-based audio systems transmit audio content to individual
speakers in a playback environment in the form of speaker feeds.
The introduction of digital cinema has created a new standard of cinematic sound, such as the
blending of multiple channels of audio, to enable greater creativity of content creators and more
immersive realistic auditory experiences of the viewer. Extensions beyond traditional speaker
feeds and channel based audio as a means of distributing spatial sound are important. There is
significant interest in model-based audio descriptions that allow the listener to select the desired
playback configuration with the audio being rendered specifically for the listener's selected
configuration. In order to further enhance the listener experience, sound reproduction in a real
three-dimensional ("3D") or virtual 3D environment is becoming an increasing research and
development area. Spatial presentation of sound utilizes audio objects. An audio object is an
audio signal having an apparent sound source position (eg, 3D coordinates), an apparent sound
source width, and associated parametric sound source descriptions of other parameters. Objectbased audio can be used for many multimedia applications such as digital movies, video games,
simulators, etc., and the relatively small listening environment limits usually limit or limit the
number of speakers and their placement Particularly important in the home environment.
Various techniques have been developed to improve the sound system in a cinematic
environment and to more accurately capture and reproduce the creative intent of the creator in a
movie soundtrack. For example, next-generation spatial audio (also denoted as "adaptive audio")
formats have been developed. The format comprises a mix of audio objects and traditional
channel-based speaker feeds, along with location metadata of the audio objects. In a spatial audio
decoder, channels are sent directly to their associated speakers (if appropriate speakers are
present) or downmixed to existing speaker sets, and audio objects are rendered by the decoder in
a flexible manner . A parametric source description associated with each object, such as a
position trajectory in 3D space, is taken as an input along with the number and position of
speakers coupled to the decoder. The renderer then distributes the audio associated with each
object across the attached speaker sets using a specific algorithm such as the panning law. In this
way, the generated spatial intent of each object is best presented across the specific speaker
configurations present in the viewing room.
Current spatial audio systems are usually developed for cinematic use, and thus use of relatively
expensive equipment including large space deployments and arrays of multiple speakers
distributed throughout the viewing room including. An increase in the amount of movie content
currently being manufactured is becoming available for playback in the home environment
through streaming technology and high-performance media technologies such as Blue-ray®. In
addition, emerging technologies such as 3D television and advanced computer games and
simulators are relatively expensive such as large screen monitors, surround sound receivers and
speaker arrays in home and other consumer (non-movie / theater) environments It encourages
the use of functional equipment. However, equipment cost, installation complexity, and room size
are practical limitations that prevent full utilization of spatial audio in many home environments.
For example, sophisticated object-based audio systems typically use overhead speakers or height
speakers to play sounds intended to occur above the listener's head. In many cases, and
especially in the home environment, such height speakers are not available. In this case, height
information is lost if such a sound object is reproduced only by speakers attached to the floor or
Thus, in a variety of different listening environments, such as arrayed speaker systems,
headphones, and other listening environments that include only a portion of the entire speaker
array for playback without limited or overhead speakers, the adaptive audio system We need a
system that can reproduce complete spatial information.
The described system and method is a state-of-the-art content generation tool based on an
adaptive audio system including a new speaker and channel configuration and a new spatial
description format enabled by a set of sophisticated content generation tools generated for movie
sound mixers. , A distribution method and spatial audio format and system including extended
user experience.
Embodiments include movie based adaptive audio concepts, home theater (eg, A / V receivers,
sound bars, and Blue-ray® players), electronic media (eg, PCs, tablets, mobile devices, and A
system that extends to other audio playback ecosystems including headphone play), broadcast
(eg TV and set top box), music, games, live sound, user generated content ("UGC"), etc. Have. The
home environment system has components that provide compatibility with theatrical content,
and content generation information, audio objects, audio feeds, spatial rendering information,
and conversations, music, environment, etc. to convey creative intent. And a metadata definition
including media intelligence information on content dependent metadata indicating a content
type such as. An adaptive audio definition may have a standard speaker feed with audio channels
and audio objects associated with spatial rendering information (such as size, velocity and
position in three-dimensional space). Also described are novel speaker layouts (or channel
configurations) that support multiple rendering techniques and an accompanying new spatial
description format. An audio stream (usually with channels and objects) is sent along with
metadata that describes the intent of the content creator or sound mixer, including the desired
position of the audio stream. Location can be expressed as a designated channel (from a given
channel configuration) or as 3D spatial location information. This channel and object format
provides the best of both channel-based and model-based audio scene description methods.
Embodiments particularly relate to systems for rendering adaptive audio content that includes
overhead sound that is to be played through a speaker mounted on an overhead or ceiling. In a
home or other small scale listening environment that does not have available overhead speakers,
the overhead sound is reproduced by a speaker driver configured to reflect the sound from the
ceiling or one or more other faces of the listening environment.
Like reference numerals are used to indicate similar elements in the following figures. Although
the following drawings show various examples, one or more implementations are not limited to
the examples shown in the figures. FIG. 16 illustrates an exemplary speaker arrangement in a
surround system (e.g., 9.1 surround) provided with height speakers for reproduction of height
channels. FIG. 10 illustrates a combination of channels and object-based data to generate an
adaptive audio mix in one embodiment. FIG. 1 is a block diagram of a playback architecture for
use in an adaptive audio system in one embodiment. FIG. 6 is a block diagram illustrating
functional components for adapting a movie based on audio content for use in a listening
environment in one embodiment. FIG. 3C is a detailed block diagram of the components of FIG.
3A in one embodiment. FIG. 1 is a block diagram of functional components of an adaptive audio
environment in one embodiment. FIG. 7 illustrates a distributed rendering system in which a
portion of the rendering functionality is implemented in a speaker unit, according to one
embodiment. Fig. 6 illustrates the deployment of an adaptive audio system in an exemplary home
theater environment. FIG. 7 shows the use of an upward-firing driver that uses reflected sound to
simulate an overhead speaker in a home theater. FIG. 14 illustrates a speaker with multiple
drivers of a first configuration for use in an adaptive audio system with a reflected sound
renderer in one embodiment. FIG. 16 illustrates a speaker system with drivers distributed in
multiple enclosures for use in an adaptive audio system with a reflected sound renderer in one
embodiment. FIG. 7 illustrates an exemplary configuration of sound bars used in an adaptive
audio system using a reflected sound renderer in one embodiment. Fig. 6 shows an exemplary
arrangement of speakers with individually addressable drivers including an upstream firing
driver located in a viewing room. FIG. 7 shows a loudspeaker configuration of an adaptive audio
5.1 system using multiple addressable drivers for reflected audio in one embodiment. FIG. 16
shows a speaker configuration of an adaptive audio 7.1 system using multiple addressable
drivers for reflected audio in one embodiment. FIG. 2 illustrates the composition of a bidirectional interconnect in one embodiment. Fig. 6 illustrates, in one embodiment, an automatic
configuration and system calibration process for use in an adaptive audio system.
FIG. 5 is a flow diagram illustrating the process steps of a calibration method used in an adaptive
audio system in one embodiment. Fig. 7 illustrates the use of an adaptive audio system in an
exemplary television and sound bar use case. FIG. 10 shows a simplified representation of threedimensional binaural headphone virtualization in an adaptive audio system, in one embodiment.
FIG. 1 is a block diagram of a headphone rendering system in one embodiment. 7 illustrates the
configuration of a BRIR filter for use in a headphone rendering system in one embodiment. Fig. 6
shows a basic head and torso model of an incident plane wave in free space that can be used in
an embodiment of a headphone rendering system. Fig. 6 shows a structural model of pinnacle
features for use in the HRTF filter in one embodiment. FIG. 6 is a table illustrating specific
metadata definitions for use in an adaptive audio system using a reflected sound renderer for a
specific listening environment in one embodiment. 5 is a graph illustrating the frequency
response of a combined filter in one embodiment. FIG. 6 is a flow chart illustrating a process of
separating an input channel into subchannels in one embodiment. FIG. 7 illustrates an up-mixer
processing multiple audio channels into multiple reflections and direct sub-channels in one
embodiment. FIG. 6 is a flow chart illustrating a process of decomposing an input channel into
subchannels in one embodiment. FIG. 16 illustrates a speaker configuration for virtual rendering
of object-based audio using reflective height speakers in one embodiment.
Systems and methods for adaptive audio systems that render reflected sound in an adaptive
audio system that does not have overhead speakers are described. Aspects of one or more
embodiments described herein include an audio or audio processing source audio information in
a mixing, rendering, and playback system, including one or more computers or processing
devices that execute software instructions. -May be implemented in a visual system. Any of the
described embodiments may be used alone or together in any combination. While various
embodiments are motivated by various shortcomings associated with the prior art, which may be
discussed or implied in one or more places herein, the embodiments do not necessarily solve any
of these disadvantages. is not. In other words, different embodiments may solve different
drawbacks that may be discussed herein. Some embodiments may solve only some of the
disadvantages discussed herein or only one. Also, some embodiments may not solve any of these
For the purpose of this description, the following terms have a related meaning. The term
"channel" means audio signals and metadata, in which the position is encoded as a channel
identifier, eg front left or top right surround. "Channel-based audio" is audio formatted for
playback through a predetermined set of speaker zones having associated nominal positions, eg
5.1, 7.1, etc. The terms "object" or "object-based audio" mean one or more audio channels with
parameter source descriptions such as explicit source position (e.g. 3D coordinates), explicit
source width, etc. "Adaptive audio" means channel-based and / or object-based audio signals and
metadata, using audio streams and metadata to render an audio signal based on the playback
environment, with location within the metadata Encoded as 3D coordinates in space. The
listening environment can be used to play audio content only or audio content with video or
other content, such as a room that can be realized in a home, movie theater, theater, hall, studio,
gaming terminal etc. By any open, partially closed or completely closed area is meant. Such
regions may have one or more surfaces disposed therein, such as walls or baffles capable of
direct or diffuse reflection of sound waves.
Adaptive Audio Formats and Systems Embodiments are "spatial audio systems" or "adaptive"
based on audio formats and rendering techniques that allow for enhanced audience immersion,
high artistic control, and system flexibility and extensibility. Reflective sound rendering system
configured to operate with a sound format and processing system that may be referred to as an
"audio system". The entire adaptive audio system is generally configured to generate one or more
bitstreams that include both audio coding distribution and traditional channel-based audio
elements and audio object coding elements. And a decryption system. Such combined approach
provides high coding efficiency and rendering flexibility compared to separately used channelbased or object-based approaches. An example of an adaptive audio system that may be used in
connection with embodiments of the present invention is described in pending International
Publication WO 2013/00638, published Jan. 10, 2013. The International Publication is
incorporated herein by reference.
An exemplary implementation of the adaptive audio system and associated audio formats is the
Dolby® AtmosTM platform. Such systems incorporate height (up / down) dimensions that may
be implemented as a 9.1 surround system or similar surround sound configuration. FIG. 1 shows
the inventive loudspeaker arrangement in a surround system (e.g. 9.1 surround) provided with
height speakers for reproduction of height channels. The speaker configuration of the 9.1 system
100 has five speakers on the floor and four speakers on the height side. Typically, these speakers
can be used to generate sounds that are designed to emanate from almost anywhere in the room.
A given loudspeaker configuration as shown in FIG. 1 may necessarily limit the ability to
accurately represent the position of a given sound source. For example, the sound source can not
pan further to the left than the left speaker itself. This applies to each speaker. Thus, onedimensional (e.g. left-right), two-dimensional (e.g. anterior-posterior), or three-dimensional (e.g.
left-right, antero-posterior, upper and lower) geometric shapes are formed, downmixing is limited
Be done. A variety of different speaker configurations and types can be used with such speaker
configurations. For example, certain enhanced audio systems may use speakers in 9.1, 11.1, 13.1,
19.4 or other configurations. Speaker types may include full range direct speakers, speaker
arrays, surround speakers, subwoofers, tweeters, and other types of speakers.
Audio objects can be thought of as sound elements that can be perceived as originating from a
particular physical location or location within the listening environment. Such objects may be
static (ie, stationary) or dynamic (ie, moving). Audio objects, along with other functions, are
controlled by metadata that defines the position of the sound at a given point in time. When
objects are played back, they are not necessarily output on a given physical channel, but are
rendered according to position metadata using existing speakers. The tracks in the session may
be audio objects. Standard panning data is similar to position metadata. In this way, the content
placed on the screen can be effectively panned in the same way as the channel based content.
However, content placed in surround may be rendered to individual speakers as needed. While
the use of audio objects provides the desired control of discrete effects, other features of the
soundtrack may function effectively in a channel based environment. For example, many
environmental effects or echoes actually benefit from being provided to the array of speakers.
Although they can be treated as objects with sufficient width to fill the array, it is advantageous
to retain the functionality based on a particular channel.
The adaptive audio system is configured to support "beds" in addition to audio objects. Here the
bed is an effective channel based submix or stem. These may be combined into individual or
single beds depending on the content creator's intent and provided for final playback (rendering).
These beds may be generated in configurations based on different channels, such as an array
including 5.1, 7.1 and 9.1 and overhead speakers as shown in FIG. FIG. 2 illustrates the
combination of channels and object-based data to generate an adaptive audio mix in one
embodiment. As shown in process 200, channel-based data 202 may be 5.1 or 7.1 surround
sound data, eg, provided in the form of pulse-code modulated (PCM) data, and may be audio
Combined with object data 204 to generate an adaptive audio mix 208. Audio object data 204 is
generated by combining elements of data based on the original channel with associated metadata
specifying specific parameters regarding the location of the audio object. As conceptually shown
in FIG. 2, the authoring tool provides the ability to simultaneously generate an audio program
that includes a combination of speaker channel groups and object channels. For example, the
audio program may optionally include one or more speaker channels organized in groups (or
tracks, eg stereo or 5.1 tracks), descriptive metadata for one or more speaker channels, one or
more object channels, And one or more object channel description metadata.
Adaptive audio systems effectively move beyond simple "speaker feeds" as a means of
distributing spatial audio. A sophisticated model-based audio description is then developed to
give listeners the freedom to choose a playback configuration that suits their individual needs or
budget, and to render audio dedicated to their individual selected configuration. At the upper
level, there are four main spatial audio description formats. (1)スピーカフィード。 Audio is
described as a signal for a loudspeaker placed at a nominal loudspeaker position. (2)マイクロ
フォンフィード。 Audio is described as signals captured by nine real or virtual microphones of a
given configuration (the number of microphones and their relative positions). (3) Description
based on model. Audio is described in terms of a sequence of audio events at the described time
and position. (4)バイノーラル。 Audio is described by the signals arriving at the two ears of
the listener.
The four description formats are often associated with the following general rendering
techniques: Here, the term "rendering" means conversion to an electrical signal used as a speaker
feed. (1)パニング。 The audio stream is converted to a speaker feed (usually rendered prior
to distribution) using the panning method set and known or assumed speaker positions. (2)
Ambisonics. The microphone signal is converted into a feed to the expandable array of
loudspeakers (usually rendered after distribution). (3) WFS (Wave Field Synthesis). Sound events
are converted into appropriate speaker signals to synthesize the sound field (usually rendered
after distribution). (4)バイノーラル。 L / R binaural signals are also distributed to the L / R
ears, usually through headphones, and also through speakers in conjunction with crosstalk
In general, any format can be converted to another format (which may require blind source
separation or similar techniques) and rendered using any of the aforementioned techniques.
However, not all transformations actually produce good results. The speaker feed format is the
most common to be simple and efficient. The best acoustical results (i.e. the most accurate and
reliable) are achieved by mixing / monitoring to the speaker feed and then directly distributing
the speaker feed. This is because there is no need for any processing between the content creator
and the listener. The speaker feed description provides the highest fidelity if the playback system
is known in advance. However, the reproduction system and its configuration are often unknown
in advance. In contrast, model-based descriptions are the most flexible. This is because modelbased descriptions do not make any assumptions about the playback system, and thus are most
easily applied to multiple rendering techniques. Model-based descriptions capture spatial
information efficiently, but become very inefficient as the number of sound sources increases.
An adaptive audio system is a simple system that has the benefits of both channel and model
based systems with high sound quality, optimal reproduction of artistic intent when mixing and
rendering with the same channel configuration, and down adaptation to the rendering
configuration. Coupled with certain benefits, including an inventory, relatively small impact on
the system pipeline, and increased immersion with fine horizontal speaker spatial resolution and
new height channels. The adaptive audio system has a single inventory with lower and upper
adaptations to specific movie rendering configurations, ie optimal use of loudspeakers available
in delayed rendering and playback environments, and inter-channel correlation (ICC). ) Improved
envelopment, including optimal downmixing to avoid artifacts, and improved spatial resolution
with a steer-thru array (e.g. one or more loudspeakers within an surround array of audio objects)
Provide a number of new features, including the ability to dynamically assign them to: (1) and
improved front channel resolution with high resolution centers or similar speaker configurations.
The spatial effects of the audio signal are important in providing the listener with an immersive
experience. Sounds intended to be emitted from a particular area of the viewing screen or
viewing room should be reproduced through speakers placed at the same relative position. Thus,
the primary audio metadata of the sound event in the model-based description is position, but
other parameters such as magnitude, orientation, velocity and acoustic dispersion can also be
described. In order to convey position, model-based 3D audio space description requires a 3D
coordinate system. The coordinate system (eg, Euclidean, sphere, cylinder) used for transmission
is usually chosen for convenience or brevity. However, other coordinate systems may be used for
the rendering process. In addition to the coordinate system, a frame of reference is needed to
represent the position of the object in space. In systems that accurately reproduce position-based
sound in a variety of different environments, it may be important to select the correct reference
frame. In other centric centers of reference frames, audio source locations are defined relative to
features within the rendering environment such as room walls and corners, standard speaker
locations, and screen locations. In an egocentric frame of reference, the position is expressed
relative to the listener's point of view, such as "my front", "little left", etc. Scientific research on
spatial cognition (audio and other) shows that self-centered cognition is almost universally used.
However, in movies, other-centred reference frames are usually more appropriate. For example,
the exact position of the audio object is most important when the associated object is on the
screen. When using other centric criteria, for each listening position, and at any screen size, the
sound is identified as being at the same relative position on the screen, eg, "the middle third of
the screen". Another reason is that mixers tend to think and mix in other-centred representations,
that panning tools are designed with other-centred frames (ie room walls), mixers render them as
such Expecting to be done, such as "this sound should be on the screen", "this sound should be
off the screen" or "from the left wall", etc.
Despite the use of other-centred reference frames in a cinematic environment, there are several
instances where a self-centered reference frame may be useful and more appropriate. These
include sounds of the non-story world where a self-centered uniform presentation is desirable, ie
sounds not present in the "story space", such as mood music. Another example is the near-field
effect (e.g., a mosquito that bursts in the listener's left ear) that requires an egocentric
presentation. Furthermore, the infinitely distant source (and the resulting plane wave) appears to
come from a fixed autocentric position (eg, 30 degrees to the left). Such sounds are more easily
described as self-centered expressions than other-centered expressions. In some instances, it is
possible to use other-centered reference frames as long as the nominal listening position is
defined. On the other hand, some examples require autocentric representations that can not yet
be rendered. While other centric criteria may be more useful and appropriate, the audio
representation should be extensible, as many new features are more desirable, including selfcentered representations in particular applications and listening environments.
Embodiments of the adaptive audio system include recommended channel configurations for
optimal sound quality and for rendering of diffuse or complex multi-point sources (e.g. stadium
audience, atmosphere) using autocentric criteria, as well as spatial Includes a hybrid spatial
description approach that includes other-centred model-based sound descriptions to efficiently
enable increased resolution and scalability. FIG. 3 is a block diagram of a playback architecture
for use in an adaptive audio system in one embodiment. The system of FIG. 3 has processing
blocks that perform conventional object and channel audio decoding, object rendering, channel
remapping, and signal processing before audio is sent to the post processing and / or
amplification and speaker stages. .
The playback system 300 is configured to render and play audio content generated through one
or more capture, preprocessing, authoring and encoding components. The adaptive audio preprocessor may have source separation and content type detection capabilities that automatically
generate the appropriate metadata through analysis of the input audio. For example, location
metadata may be derived from multi-channel recording through analysis of relative levels of
correlation inputs between channel pairs. The detection of content types, such as speech or
music, may be achieved, for example, by feature extraction and classification. A specific
authoring tool optimizes the input and organization of the acoustic engineer's creative intentions,
letting him generate the final audio mix when optimized for playback in virtually any playback
environment Allows authoring of audio programs. This can be accomplished through the use of
audio objects and position data associated with and encoded with the original audio content. In
order to accurately place the sound around the public halls, the sound engineer needs to control
how the sound is ultimately rendered based on the actual constraints and features of the
playback environment. An adaptive audio system provides this control by having the acoustician
change how audio content is designed and mixed through the use of audio objects and position
data. Adaptive audio content is decoded and rendered in the various components of the playback
system 300 once authored and encoded with an appropriate codec device.
As shown in FIG. 3, (1) legacy surround sound audio 302, (2) object audio 304 including object
metadata, and (3) channel audio 306 including channel metadata are processed in decoder stage
308 within processing block 310. , 309 is input. オブジェクトメタデータは、オブジェクトレン
ダラ312でレンダリングされる。 On the other hand, channel metadata may be remapped as
needed. Room configuration information 307 is provided to the object renderer and the channel
remapping component. The hybrid audio data is then processed through one or more signal
processing stages, such as equalizer and limiter 314, prior to output to B-chain processing stage
316 and playback through speaker 318. System 300 represents an example of a playback system
for adaptive audio. Other configurations, components, and interconnections are also possible.
<Playback Application> As mentioned above, the initial implementation of the adaptive audio
format and system is authored using a novel authoring tool, packaged using an adaptive audio
cinema encoder, and a PCM or proprietary lossless codec An environment of digital cinema (Dcinema) including content capture (objects and channels) distributed using an existing Digital
Cinema Initiative (DCI) distribution mechanism. In this example, audio content is intended to be
decoded and rendered at digital cinema to create an immersive spatial audio cinema experience.
However, with the advancement of previous cinemas such as analog surround sound, digital
multi-channel audio, etc., there is a need to provide listeners at home with an improved user
experience provided directly by the adaptive audio format. This requires that the format and
specific features of the system be adapted for use in a more limited listening environment. For
example, homes, rooms, small public halls, or similar locations may have reduced space,
acoustical characteristics, and equipment capabilities as compared to a cinema or theater
environment. For the purpose of explanation, the term "consumer based environment" refers to
any non-cinema environment having a listening environment for professional customer or
professional use such as a house, studio, room, operating area, hall, etc Intended to include.
Audio content may be sourced and rendered independently, or may be associated with graphic
content, such as still images, light displays, video, etc.
FIG. 4A is a block diagram illustrating functional components for adapting a movie based on
audio content for use in a listening environment in one embodiment. As shown in FIG. 4A, at
block 402, movie content that typically has a video soundtrack is captured and / or authored
using appropriate equipment and tools. In an adaptive audio system, at block 404, this content is
processed through encoding / decoding and rendering components and interfaces. The resulting
object and channel audio feed is then transmitted at 406 to the appropriate speakers in the
cinema or theater. In system 400, at 416, movie content is also processed for playback in a
listening environment, such as a home theater system. The listening environment is not
comprehensive or does not have the ability to reproduce all of the sound content intended by the
content creator due to limited space, low number of speakers, etc. However, embodiments allow
rendering of the original audio content to minimize constraints imposed by the reduced ability of
the listening environment, as well as processing position cues to maximize available equipment.
It is directed to enabling systems and methods. As shown in FIG. 4A, movie audio content is
processed through a movie-to-consumer converter component 408. Here, movie audio content is
processed within the consumer content encoding and rendering chain 414. This chain also
processes the original consumer audio content captured and / or authored at block 412. Next at
416, the original consumer content and / or converted movie content is played back in the
listening environment. Thus, the associated spatial information encoded within the audio content
renders the sound in a more immersive manner at 416, even with possibly limited speaker
configurations in a home or other consumer listening environment Can be used to
FIG. 4B shows the components of FIG. 4A in more detail. FIG. 4B illustrates an exemplary
distribution mechanism for adaptive audio movie content throughout the consumer ecosystem.
As shown in FIG. 420, the original movie and TV content is captured at 422, authored at 423,
and provides movie experience at 427 or consumer environment experience at 434 for playback
in a variety of different environments. . Similarly, user generated content (UGC) or consumer
content specific user is captured at 423 and authored at 425 for playback in a listening
environment at 434. For example, movie content for playback in movie environment 427 is
processed through known movie processing 426. However, at system 420, the output of the
movie authoring toolbox 423 also includes an audio object, an audio channel, and metadata that
convey the artistic intent of the sound mixer. This can be thought of as a mezzanine style audio
package that can be used to generate multiple versions of movie content for playback. In one
embodiment, this functionality is provided by a movie-to-consumer adaptive audio converter
430. The converter has an input to the adaptive audio content and then extracts the appropriate
audio and metadata content for the desired consumer endpoint 434. The converter produces
separate and possibly different audio and metadata outputs depending on the consumer
distribution mechanism and the endpoint.
As shown in the example of system 420, movie-to-consumer converter 430 provides sound to
image (eg, broadcast, disc, OTT, etc.) and game audio bitstream generation module 428. These
two modules are suitable for distributing movie content and can be provided to multiple
distribution pipelines 432. All of the plurality of distribution pipelines 432 may be distributed to
consumer endpoints. For example, adaptive audio movie content may be encoded using a codec
suitable for broadcast purposes, such as Dolby Digital Plus, and may be modified to convey
channels, objects and associated metadata, broadcast chains Transmitted via cable or satellite
through, and then decoded and rendered at home for home theater or television playback. .
Similarly, the same content is encoded using a codec suitable for bandwidth-limited online
distribution, then transmitted over a 3G or 4G mobile network, and then for playback by a mobile
device using headphones. Decoded and rendered. Other content sources such as TVs, live
broadcasts, games and music may also use adaptive audio formats to generate and provide
content in next-generation spatial audio formats.
The system of FIG. 4B includes a home theater (eg, A / V receiver, sound bar and BluRay®),
electronic media (eg, PC, tablet, mobile including headphone playback), broadcast (TV and set top
box) Provide an extended user experience throughout the audio ecosystem, which may include
music, games, live sound, user generated content, etc. Such systems include immersive extension
of the audience of all endpoint devices, artistic control of audio content creators, content
dependent (descriptive) metadata improvements to improve rendering, flexibility of playback
systems and It offers the opportunity for extensibility extensions, sound quality maintenance and
alignment, and dynamic rendering of content based on user location and interaction. The system
includes new mixing tools for content creators, distribution and playback, in-home dynamic
mixing and rendering (suitable for different listening environment configurations), updated new
packages and coding tools for additional speaker location and design. Has several components
The adaptive audio ecosystem is a completely comprehensive end-to-end next-generation audio
system using adaptive audio formats, including content generation, packaging, distribution and
playback / rendering across multiple endpoint devices and use cases. Configured to be. As shown
in FIG. 4B, the system originates from a number of different use cases and content captured for
them. These capture points have all relevant content formats including movies, TV, live broadcast
(and sound), UGC, games and music. Content passes through the ecosystem, pre-processing and
authoring tools, conversion tools (ie, conversion of adaptive audio content for movie-to-consumer
content distribution applications), specific adaptive audio packaging / bitstream coding (Capture
audio basic data with additional metadata and audio reproduction information), efficient
distribution over different audio channels, transmission over related distribution channels (eg
broadcast, disc, mobile, internet etc), and spatial Existing or new code for dynamic endpoint
aware end-to-end rendering to reproduce and convey the adaptive audio user experience defined
by the content creator to provide the benefits of the audio experience Click (e.g., DD +, TrueHD,
Dolby Pulse) through several key steps such as dispensing coding using. An adaptive audio
system can be used during rendering for a widely varying number of consumer endpoints, and
the rendering techniques applied can be optimized depending on the endpoint device. For
example, the home theater system and sound bar may have 2, 3, 5, 7 or 9 speakers at various
positions. Many other types of systems have only two speakers (eg TV, laptop, music dock) and
almost all commonly used devices have headphone output (eg PC, laptop, Tablets, mobile phones,
music players, etc.).
Current authoring and distribution systems for non-movie audio have only limited knowledge of
the type of content conveyed in audio essence (ie the actual audio reproduced by the
reproduction system) Generate and distribute audio for reproduction to and of fixed speaker
locations. However, the adaptive audio system has a new choice of both fixed-speaker-positionspecific audio (left channel, right channel, etc.) and audio elements based on objects with generic
3D spatial information including position, size and velocity. Provide a hybrid approach to audio
generation. This hybrid approach provides a balanced approach of fidelity (provided by fixed
speaker positions) and flexibility in rendering (general purpose audio objects). The system also
provides additional useful information about the audio content with the new metadata that is
paired with the audio essence by the content creator at content generation / authoring. This
information provides detailed information on the audio attributes that can be used during
rendering. Such attributes include audio objects such as content type (eg, conversation, music,
effects, Foley, background / atmosphere, etc.), as well as spatial attributes (eg, 3D position, object
size, speed, etc.) It may include information and useful rendering information (e.g., quick
movement to speaker position, channel weights, gains, bass management information, etc.).
Audio content and reproduction intent metadata can be generated through the use of automatic
media intelligence algorithms that can be generated manually by the content creator or can be
performed in the background during the authoring process, and, if necessary, the content creator
during the final quality control phase Can be reviewed by
FIG. 4C is a block diagram of functional components of an adaptive audio environment in one
embodiment. As shown in FIG. 450, the system processes the coded bit stream 452 conveying
both the hybrid object and the channel based audio stream. The bitstream is processed by the
rendering / signal processing block 454. In one embodiment, at least a portion of this functional
block may be implemented within the rendering block 312 shown in FIG. The rendering function
454 implements various rendering algorithms for adaptive audio, as well as specific postprocessing algorithms such as up-mixing, reflection-oriented processing, and so on. The output
from the renderer is provided to the speaker 458 through a bi-directional interconnect 456. In
one embodiment, the speaker 458 has a number of individual drivers that can be arranged in
surround sound or similar configuration. The drivers are individually addressable and may be
embodied in individual enclosures or multiple driver cabinets or arrays. System 450 may have a
microphone 460 that provides a measurement of room characteristics that can be used to
calibrate the rendering process. System configuration and calibration functions are provided at
block 462. These functions may be included as part of the rendering component. Alternatively,
these functions may be implemented as separate components that are functionally coupled to the
renderer. The bi-directional interconnect 456 provides a feedback signal path from the speaker
environment (view room) to the calibration component 462.
Distributed / Concentrated Rendering In one embodiment, the renderer 454 has functional
processing embedded in a central processor associated with the network. Alternatively, the
renderer may have functional processing performed at least in part by circuitry in or coupled to
each of the drivers of the array of individually addressable audio drivers. In the case of
centralized processing, rendering data is transmitted to the individual drivers in the form of
audio signals transmitted by the individual audio channels. In the distributed processing
embodiment, the central processor performs any rendering or at least partial rendering of the
audio data with the final rendering performed by the driver. In this example, the powered
speakers / drivers need to enable the on-board processing function. One exemplary
implementation is the use of a speaker with an integrated microphone. Here, the rendering is
adapted based on the microphone data and the adjustment is done in the loudspeaker itself. This
eliminates the need to transmit the microphone signal back to the central renderer for calibration
and / or configuration purposes.
FIG. 4D shows, in one embodiment, a distributed rendering system in which some of the
rendering functionality is implemented in the speaker. As shown in FIG. 470, the encoded bit
stream 471 is input to a signal processing stage 472 that includes partial rendering components.
The partial renderer may perform any suitable percentage of rendering functions, such as no
rendering at all or up to 50% or 75%. The original encoded bit stream or partial rendering bit
stream is then sent to the speaker 472 via interconnect 476. In this embodiment, the speaker is a
self-powered unit having a driver and a DC power supply or an onboard battery. The speaker unit
472 also includes one or more integrated microphones. The renderer and optional calibration
function 474 are integrated into the speaker unit 472. The renderer 474 performs final or
complete rendering operations on the coded bitstream, as necessary, depending on how much
rendering is performed by the partial renderer 472. In a fully distributed implementation, the
speaker calibration unit 474 may use sound information generated by the microphone to
perform calibration directly to the speaker driver 472. In this example, interconnect 476 may be
only a unidirectional interconnect. In an alternative or partially decentralized implementation, an
integration or other microphone may be associated with signal processing stage 472 to provide
sound information back to optional calibration unit 473. In this example, interconnect 476 is a
bi-directional interconnect.
Listening Environment The implementation of the adaptive audio system is intended to be
deployed in a variety of different listening environments. These have three major areas of
consumer applications: home theater systems, televisions and sound bars, and headphones, and
may also include cinemas, theatres, studios, and other large or professional environments. FIG. 5
illustrates the deployment of an adaptive audio system in an exemplary home theater
environment. The system of FIG. 5 illustrates a superset of the components and functions that
may be provided by an adaptive audio system. Certain features may be reduced or eliminated
based on the needs of the user while providing an enhanced experience. System 500 includes
various different speakers and drivers in various different cabinets or arrays 504. The speakers
have front, side and upstream firings, as well as individual drivers that provide dynamic
virtualization of audio using specific audio processing techniques. Diagram 500 shows a number
of speakers deployed in a standard 9.1 speaker configuration. These include left and right height
speakers (LH, RH), left and right speakers (L, R), center speakers (shown as modified center
speakers), and left and right surround and back speakers (LS, LR, LB) And RB, the low frequency
component LFE is not shown).
FIG. 5 illustrates the use of a center channel speaker 510 used in a room or theater central
location. In one embodiment, the speaker is implemented with a modified center channel or high
resolution center channel 510. Such speakers may be front firing center channel arrays with
individually addressable speakers that allow discrete panning of audio objects through the array
to match the movement of video objects on the screen. This is embodied as a high-resolution
center channel (HRC) speaker as described in International Patent Publication WO
2011/119401, published on September 29, 2011, which is incorporated herein by reference. It
is good. The HRC speaker 510 may have a side firing speaker as shown. These can be activated
and used when the HRC speaker is used not only as a central speaker but also as a speaker with
sound bar capability. HRC speakers may be incorporated on and / or laterally of screen 502 to
provide a two-dimensional high resolution panning option for audio objects. The center speaker
510 may implement a steerable sound beam with additional drivers and having separately
controlled sound zones.
The system 500 also includes a near field effect (NFE) speaker 512 that may be placed at or near
the front of the listener, such as on a table in front of the seating position. With adaptive audio, it
is possible to bring an audio object into a room rather than just fix it around the room. Therefore,
there is an option to traverse the object through a three dimensional space. An example is where
an object originates at the L speaker, travels through the room through the NFE speaker and
ends at the RS speaker. A variety of different speakers may be suitable for use as NFE speakers,
such as wireless battery powered speakers.
FIG. 5 illustrates the use of dynamic speaker virtualization to provide an immersive user
experience in a home theater environment. Dynamic speaker virtualization is realized through
dynamic control of speaker virtualization algorithm parameters based on object space
information provided by adaptive audio content. This dynamic virtualization is shown in FIG. 5
for the L and R speakers. It is natural to think of this as generating the perception of objects
moving along the sides of the room, and a separate virtualizer is used for each related object, and
the combined signal is a multiple object It can be sent to the L and R speakers to create a
virtualization effect. Dynamic virtualization effects are shown for NFE speakers, which are
intended to be L and R speakers, as well as stereo speakers (with two independent inputs). This
speaker may be used to generate a diffuse or point source near-field audio experience, along with
audio objects and position information. Similar virtualization effects can be applied to any or all
other speakers in the system. In one embodiment, the camera may provide additional listener
position and identification information that may be used by the adaptive audio renderer to
provide a more inspiring experience more faithful to the artistic intent of the mixer.
The adaptive audio renderer understands the spatial relationship between the mix and the
playback system. In some instances of the playback environment, discrete speakers may be
available in all relevant areas of the room, including overhead locations as shown in FIG. In these
instances where discrete speakers are available at a particular location, the renderer will
change the object to the closest speaker instead of creating a phantom image between two or
more speakers through the use of panning or a speaker virtualization algorithm. It can be
configured to "snap". This slightly distorts the spatial representation of the mix, but allows the
renderer to avoid unintended phantom images. For example, if the angular position of the left
speaker in the mixing stage does not correspond to the angular position of the left speaker in the
reproduction system, enabling this function may avoid having a constant phantom image of the
initial left channel.
However, in many cases, and especially in the home environment, certain speakers such as
overhead speakers mounted on the ceiling are not available. In this example, certain
virtualization techniques are implemented by the renderer to reproduce overhead audio content
through existing floor or wall mounted speakers. In one embodiment, the adaptive audio system
includes changes to the standard configuration through the inclusion of both front firing
capability and top (or "upward") firing capability of each speaker. In traditional home
applications, speaker manufacturers are trying to introduce new driver configurations other than
front firing transducers, and which of the original audio signals (or changes to them) are to these
new drivers. Faced with the problem of trying to identify what should be sent. In an adaptive
audio system, there is very specific information as to which audio objects are rendered on a
standard horizontal plane. In one embodiment, height information present in the adaptive audio
system is rendered using an upward firing driver. Similarly, side-firing speakers can be used to
render certain other content such as atmosphere effects.
One advantage of the upstream firing drivers is that they can be used to reflect sound from a
rigid ceiling surface to simulate the presence of overhead / height speakers located on the
ceiling. A dominant attribute of adaptive audio content is that spatially diverse audio is rendered
using an array of overhead speakers. However, as mentioned above, in many instances the
installation of overhead speakers is expensive or impractical in a home environment. By
simulating height speakers with speakers that are nominally located in the horizontal plane, an
emotional 3D experience can be generated with easy to position speakers. In this example, the
adaptive audio system is an upstream firing / height simulation driver, a new method used to
generate audio in which audio objects and their spatial rendering information are rendered by
the upstream firing driver. Is used.
FIG. 6 illustrates the use of an upstream firing driver that uses reflected sound to simulate a
single overhead speaker in a home theater. It should be noted that any number of upstream
firing drivers may be used in combination to generate multiple simulated height speakers.
Alternatively, the number of upstream firing drivers may be configured to transmit sound to
substantially the same point on the ceiling to achieve a particular sound intensity or effect.
The diagram 600 shows an example in which the normal listening position 602 is located at a
specific place in the room. The system does not have any height speakers transmitting audio
content including height cues. Instead, the speaker cabinet or speaker array 604 has an
upstream firing driver along with a front firing driver. The upward firing driver is configured to
transmit its sound wave 606 (with respect to position and tilt angle) to a particular point 608 on
the ceiling. The sound wave is reflected back to the listening position 602. The ceiling is assumed
to have the proper materials and configuration to properly reflect the sound into the room. The
associated characteristics (eg, size, power, position, etc.) of the upward firing driver may be
selected based on the configuration of the ceiling, the size of the room and other associated
characteristics of the listening environment. Although only one upstream firing driver is shown in
FIG. 6, in some embodiments, multiple upstream firing drivers may be incorporated into the
reproduction system.
In one embodiment, the adaptive audio system uses an upward firing driver to provide height
elements. In general, incorporating signal processing to introduce perceptual height cues into the
audio signal provided to the upstream firing driver improves the positioning and perceived
quality of the virtual height signal. For example, parametric perceptual binaural auditory models
have been developed to generate hight cue filters. The model improves the perceptual quality of
reproduction when used to process the audio being reproduced by the upward firing driver. In
one embodiment, the height cue filter is derived from both the physical speaker position
(approximately, the same height as the listener) and the reflective speaker position (above the
listener). For physical speaker position, the directional filter is determined based on the model of
the outer ear (or pinna). The inverse transform of this filter is then determined and used to
remove height cues from the physical speakers. Next, for the reflected speaker position, a second
directional filter is determined using the same model of the outer ear. This filter is applied
directly and essentially reproduces the cues that the ear receives when sound is on the listener.
In practice, these filters may be combined such that a single filter can (1) remove height cues
from physical speaker locations and (2) insert height cues from reflective speaker locations. FIG.
16 is a graph showing the frequency response of the combined filter. A combined filter may be
used to allow for certain adjustment capabilities in terms of the aggressiveness or quantity of the
applied filter. For example, in some instances, only part of the sound from the physical speakers
will arrive directly at the listener (the rest is reflected from the ceiling), so the physical speaker
height cues will not be completely removed, or It may be advantageous not to apply the
reflection speaker height cue completely.
Speaker Configuration The primary consideration of the adaptive audio system for personal use
or similar applications is the speaker configuration. In one embodiment, the system uses
individually addressable drivers. An array of such drivers is configured to provide a combination
of both direct and reflected sound sources. The bi-directional link to the system control (eg, A / V
receiver, set top box) allows audio and configuration data to be sent to the speaker, and speaker
and sensor information to be sent back to the control to create an aggressive closed loop system
For purposes of explanation, the term "driver" means a single electroacoustic transducer that
produces sound in response to an electrical audio input signal. The driver may be implemented in
any suitable type, geometry, and size, and may include horns, cones, ribbon transducers, etc. The
term "speaker" means one or more drivers in a single housing. FIG. 7A shows a speaker having a
plurality of drivers of a first configuration in one embodiment. As shown in FIG. 7A, the speaker
housing 700 has a number of individual drivers mounted within the housing. Typically, the
housing includes one or more front firing drivers 702, such as a low range speaker, a mid range
speaker or a high range speaker, or any combination thereof. One or more side firing drivers 704
may also be included. The front and side firing drivers are usually mounted snugly on the sides of
the housing so that they emit sound that exits vertically from the vertical plane defined by the
speakers, and these drivers are inside the cabinet 700 Usually fixed permanently. In an adaptive
audio system featuring reflective sound rendering, one or more upstream tilt drivers 706 are also
provided. These drivers are positioned so that they fire sound to the ceiling at an angle, as shown
in FIG. 6, and then the sound bounces back to the listener at the ceiling. The degree of tilt may be
set depending on room characteristics and system requirements. For example, the upward driver
706 may be tilted up between 30 and 60 degrees to minimize interference with the sound waves
generated from the front firing driver 702, and may be within the speaker enclosure 700. It may
be located on the front firing driver 702. The upward firing driver 706 may be installed at a fixed
angle or may be installed so that the tilt angle can be manually adjusted. Alternatively, a servo
mechanism may be used to allow automatic or electrical control of the tilt angle and firing
direction of the upward firing driver. For certain sounds, such as ambient sounds, the upward
firing driver may be directed in a straight line to the outside of the top surface of the speaker
enclosure 700 to create a so-called "top firing" driver.
In this example, the loud sound component may be reflected back onto the speaker depending on
the acoustic characteristics of the ceiling. However, in many instances, as shown in FIG. 6, certain
tilt angles are typically used to help emit sound through reflections from ceilings to different or
even central locations in the room.
FIG. 7A is intended to show an example of a speaker and driver configuration. Many other
configurations are also possible. For example, the upstream firing driver may be provided in its
own housing for use with existing speakers. FIG. 7B shows a speaker system with drivers
distributed among multiple enclosures in one embodiment. As shown in FIG. 7B, the upward
firing driver 712 is provided in another housing 710. Another housing 710 can be placed close
to or on the housing 714 with front and / or side firing drivers 716 and 718. As used in many
home theater environments, the driver may be contained within a speaker sound bar where
multiple small or medium sized drivers are arranged along an axis within a single horizontal or
vertical enclosure. FIG. 7C shows the placement of drivers in the sound bar in one embodiment.
In the present example, the sound bar housing 730 is a horizontal sound bar having a side firing
driver 734, an upward firing driver 736, and a front firing driver 732. FIG. 7C is intended for
exemplary configuration only. Any actual number of drivers may be used for each of the
functions-front, side and upstream firing.
For the embodiment of FIGS. 7A-7C, the driver may be of any suitable shape, size and type,
depending on the required frequency response characteristics and any other relevant constraints
such as size, power rating, component cost etc. It should be noted that there may be.
In a standard adaptive audio environment, multiple speaker enclosures are included in the room.
FIG. 8 shows an exemplary arrangement of speakers with individually addressable drivers
including an upstream firing driver located in a room. As shown in FIG. 8, room 800 has four
separate speakers 806, each with at least one front firing, side firing, and upward firing driver.
The room may have a center speaker 802 and a fixed driver used for surround sound
applications such as a subwoofer or LFE 804. As can be seen from FIG. 8, depending on the size
of the room and the individual speaker units, proper placement of the speakers 806 in the room
is provided by the reflection of sound from the multiple upstream firing drivers at the ceiling It
can provide a rich audio environment. The speaker may be intended to provide reflections from
one or more points on the ceiling surface, depending on the content, room size, listener position,
acoustical properties and other related parameters.
The speakers used in the adaptive audio system for home theater or similar environments may
use configurations based on existing surround surround configurations (e.g. 5.1, 7.1, 9.1, etc.). In
this example, a number of drivers are provided and defined by the known surround sound
practice, and additional drivers are provided and defined for the up-to-date firing sound
FIG. 9A shows the speaker configuration of an adaptive audio 5.1 system using multiple
addressable drivers for reflected audio in one embodiment. In configuration 900, a standard 5.1
loudspeaker footprint with LFE 901, center speaker 902, L / R front speakers 904/906, and L /
R rear speakers 908/910 is provided with eight additional drivers. , Giving a total of 14
addressable drivers. These eight additional drivers are shown as "upward" and "sideword" in
addition to the "forward" (or "front") in each speaker unit 902-910. Direct forward drivers can be
driven by subchannels that include adaptive audio objects and any other components designed to
have a high degree of directivity. The upward firing (reflection) driver may also have
omnidirectional or non-directional subchannel content, but is not limited thereto. Examples may
have background music or environmental sounds. If the input to the system has legacy surround
sound content, this content may be intelligently woven into the direction and reflection
subchannels and provided to the appropriate drivers.
For direct subchannels, the loudspeaker enclosure may have a driver in which the central axis of
the driver bisects the "sweet spot" or the acoustic center of the room. The upward firing driver
may be positioned such that the angle between the central axis of the driver and the acoustic
center is at a particular angle within the range of 45 to 180 degrees. In the example of
positioning the driver at 180 degrees, the back driver can provide sound diffusion by reflection
from the back wall. This arrangement ensures that early arriving signal components are coherent
after aligning the upstream firing driver directly with the driver, while late arriving components
benefit from the natural diffusion provided by the room. Use acoustic principles.
In order to achieve the height cue provided by the adaptive audio system, the upward firing
driver is tilted upwards from the horizontal surface, which in this example emits in a straight line,
directly on a reflective surface or enclosure such as a flat ceiling. It may be positioned to reflect
from an acoustic diffuser placed above. To provide additional directivity, the center speaker can
use a sound bar configuration (as shown in FIG. 7C) that has the ability to steer the crossover
sound to the screen to provide a high resolution center channel.
The 5.1 configuration of FIG. 9A can be expanded by adding two additional back enclosures
similar to the standard 7.1 configuration. FIG. 9B shows the speaker configuration of an adaptive
audio 7.1 system using multiple addressable drivers for reflected audio in one such embodiment.
As shown in configuration 920, the two additional enclosures 922 and 924 are similar to the
front enclosure and the upward firing driver, which are set to bounce off the ceiling halfway
between the existing front and rear pairs, The side speakers are placed in the "left side surround"
and "right side surround" positions facing the side wall. Such incremental addition can be done as
many times as necessary, and the additional pair fills the gap along the side wall or back wall.
FIGS. 9A and 9B show only some examples of possible configurations of an enhanced surround
sound speaker layout that can be used with upstream and side firing speakers in an adaptive
audio system for a listening environment. Many other configurations are also possible.
N. As an alternative to one configuration, a more flexible pod based system may be used. This
allows each driver to be housed in its own housing and mounted at a convenient location. This
may use a driver configuration as shown in FIG. 7B. These individual units are n. It may be
clustered into one configuration or may be scattered individually in the room. The pods are not
necessarily limited to being placed at the edge of the room, but may be placed on any surface (eg,
coffee table, bookshelf, etc.) within the listening environment. Such a system is easy to expand
and also allows the user to add more speakers over time to create an immersive experience. If the
speaker is wireless, the pod system may have the ability to dock the speaker for recharging
purposes. In this design, the pods are docked together, perhaps to allow them to act like a single
speaker while they are recharged to listen to stereo music, and then undocked, It can be located
in the room for adaptive audio content.
In order to extend the configurability and accuracy of the adaptive audio system by using an
upstream firing addressable driver, a large number of sensors and feedback devices may be
encased to inform the renderer of properties that may be used in the rendering algorithm Can be
added to For example, microphones installed in each housing allow the system to measure the
phase, frequency, and reverberation characteristics of the room as well as the position of the
speakers relative to one another, using triangulation and HRTF-like functions of the housing itself
You can do so. Inertial sensors (eg, gyroscopes, compasses, etc.) may be used to detect the
direction and angle of the housing. Light and vision sensors (e.g., laser based infrared range
finders) can be used to provide position information for the room itself. These representatives are
just a few of the additional sensors that can be used in the system, and other sensors are also
Such sensor systems can be further extended by automatically adjusting the position of the
driver and / or the acoustic modifiers of the housing by means of an electromechanical servo.
This may cause the driver's orientation to change at runtime ("active steering") to adapt the
positioning of the driver in the room relative to the wall and other drivers. Similarly, any acoustic
modifiers (such as baffles, horns, or waveguides) can be tuned to provide the correct frequency
and phase response for optimal reproduction in any room configuration ("Active tuning"). Both
active steering and active tuning may be performed during initial room configuration (with an
automatic EQ / automatic room configuration system) or during playback in response to content
being rendered.
Bidirectional Interconnection Once configured, the speakers must be connected to the rendering
system. Traditional interconnections are usually of two types. Speaker level input for passive
speakers and line level input for active speakers. As shown in FIG. 4C, the adaptive audio system
450 has bi-directional interconnection capability. This interconnection is implemented in a set of
physical and logical connections between the rendering stage 454 and the amplifier / speaker
458 and the microphone stage 460. The ability to address multiple drivers in each speaker
cabinet is supported by these intelligent interconnections between sound sources and speakers.
Bidirectional interconnection allows transmission of the signal from the source to the speaker
(renderer) to include both control and audio signals. The signal from the loudspeaker to the
sound source comprises both control and audio signals. Here, the audio signal in this example is
audio sourced from an optional built-in microphone. Power may be supplied as part of a bidirectional interconnect, at least in the example where the speakers / drivers are not separately
FIG. 10 is a diagram 1000 illustrating the composition of a bi-directional interconnect in one
embodiment. The sound source 1002 may represent a renderer and amplifier / sound processor
chain, and is logically and physically coupled to the speaker cabinet 1004 through a pair of
interconnecting links 1006 and 1008. The interconnect 1006 from the sound source 1002 to the
driver 1005 in the speaker cabinet 1004 comprises the electroacoustic signal of each driver, one
or more control signals, and light power. The interconnect 1008 from the speaker cabinet 1004
back to the sound source 1002 comprises the sound signal from the microphone 1007 or other
sensor for calibration of the renderer or other similar sound processing function. The feedback
interconnect 1008 also has specific driver definitions and parameters used by the renderer to
modify or process the sound signal set in the driver via interconnect 1006.
In one embodiment, each driver in each of the system's cabinets is assigned an identifier (e.g.,
numerical assignment) during system setup. Each speaker cabinet can be uniquely identified.
This numerical assignment is used by the speaker cabinet to determine which audio signal should
be sent to which driver in the cabinet. The assignments are stored in an appropriate memory
device in the speaker cabinet. Alternatively, each driver may be configured to store its identifier
in local memory. In a further alternative, as in the example where the driver / speaker does not
have local storage capability, the identifier may be stored in the rendering stage or other
components in the sound source 1002. During the speaker discovery process, each speaker (or
central database) is queried by the sound source for its profile. The profile may include the
number of drivers in the speaker cabinet or other defined array, the acoustic characteristics of
each driver (eg, driver type, frequency response, etc.), x, y of the center of each driver relative to
the front center of the speaker cabinet, Define specific driver definitions, including z position,
angle of each driver to a defined surface (eg, ceiling, floor, cabinet vertical axis, etc.), and the
number of microphones and microphone characteristics. Other related driver and microphone /
sensor parameters may also be defined. In one embodiment, driver definitions and speaker
cabinet profiles may be expressed as one or more XML documents used by the renderer.
In one possible implementation, an internet protocol (IP) control network is generated between
the sound source 1002 and the speaker cabinet 1004. Each speaker cabinet and sound source
operates as a single network endpoint and is given a link local address upon initialization or
power up. An auto-discovery mechanism such as zero configuration networking (zeroconf) may
be used to ensure that sound is sourced to each speaker on the network. Zero configuration
networking is an example of a process that automatically generates a usable IP network without
manual operator intervention or a dedicated configuration server. Other similar techniques may
be used. Given an intelligent network system, multiple sources may be present on the IP network
as speakers. This allows multiple sources to drive the speakers directly without routing the sound
through a "master" audio source (eg, a traditional A / V receiver). If another source is trying to
address a speaker, it determines which source is currently "active", whether it needs to be active,
and whether control can be transferred to a new source In order to do so, communication is
performed between all sources. Sources may be pre-assigned priorities during manufacturing
based on their classification. For example, the communication source may have a higher priority
than the entertainment source. In a multi-room environment, such as a standard home
environment, all speakers in the entire environment may be on a single network, but do not have
to be addressed simultaneously. During setup and automatic configuration, the sound level
returned via interconnect 1008 can be used to determine which speakers are located in the same
physical space. Once this information is determined, the speakers may be grouped into clusters.
In this example, cluster IDs are assigned and form part of the driver definition. The cluster ID is
sent to each speaker. Each cluster can be addressed simultaneously by the sound source 1002.
As shown in FIG. 10, arbitrary power signals can be transmitted via a bi-directional interconnect.
The speakers may be passive (requiring external power from a sound source) or active (requiring
power from an electrical outlet). If the speaker system has an active speaker without wireless
support, the input to the speaker has a wired Ethernet input compliant with IEEE 802.3. If the
speaker system has an active speaker with wireless support, the input to the speaker has an IEEE
802.3 compliant wireless Ethernet input or alternatively a wireless standard formulated by the
WISA organization. Passive speakers may be supplied by a suitable power signal provided
directly by the sound source.
System Configuration and Calibration As shown in FIG. 4C, the functionality of the adaptive audio
system comprises a calibration function 462. This function is enabled by the microphone 1007
and interconnect 1008 links shown in FIG. The function of the microphone component in the
system 100 is to measure the responses of the individual drivers in the room to obtain an overall
system response. Multiple microphone topologies can be used for this purpose, including a single
microphone or an array of microphones. The simplest example is when a single omnidirectional
measurement microphone located at the center of a room is used to measure the response of
each driver. If the room and playback conditions warrant a more detailed analysis, multiple
microphones can be used instead. The most convenient location for multiple microphones is in
the physical speaker cabinet of the particular speaker configuration used in the room.
Microphones installed in each enclosure allow the system to measure the response of each driver
at multiple locations in the room. An alternative to this topology is to use multiple
omnidirectional measurement microphones generally positioned at the location of the listener in
the room.
Microphones are used to enable automatic configuration and calibration of renderers and postprocessing algorithms. In an adaptive audio system, the renderer converts, in one or more
physical speakers, an audio stream based on hybrid objects and channels into individual audio
signals designed for a particular addressable driver. Post-processing components may include
delay, equalization, gain, speaker virtualization, and upmixing. The speaker configuration may
represent important information that renderer components can use to convert the hybrid object
and channel-based audio stream into individual driver-by-driver audio signals to provide optimal
playback of audio content. There are many. The system configuration information includes (1)
the number of physical speakers in the system, (2) the number of individual addressable drivers
in each speaker, and (3) the position and orientation of each individually addressable driver with
respect to the room shape. And. Other characteristics are also possible. FIG. 11 illustrates the
functionality of the automatic configuration and system calibration component in one
embodiment. As shown in FIG. 1100, an array of one or more microphones 1102 provides
acoustic information to the configuration and calibration component 1104. This acoustic
information captures certain relevant characteristics of the listening environment. The
configuration and calibration component 1104 then provides this information to the renderer
1106 and any associated post-processing component 1108 so that the audio signal ultimately
sent to the speakers is adjusted and optimized for the listening environment Do.
The number of physical speakers in the system and the number of individually addressable
drivers in each speaker are physical speaker characteristics. These characteristics are sent
directly from the speakers to the renderer 454 via the bi-directional interconnect 456. The
renderer and speaker use a common discovery protocol. Thus, when the speakers are connected
or disconnected from the system, the renderer is notified of the changes and can accordingly
reconfigure the system.
The geometry (size and shape) of the viewing room is an information item necessary in the
configuration and calibration process. Geometry can be determined in a number of different
ways. In the manual calibration mode, the width, length and height of the room's smallest
bounding cube is system by the listener or technician through a user interface that provides
input to the renderer or other processing unit in the adaptive audio system. Is input to Various
different user interface techniques and tools may be used for this purpose. For example, room
geometry may be sent to the renderer by a program that automatically maps or traces room
geometry. Such systems may use a combination of computer vision, sonar, and 3D laser based
physical mapping.
The renderer uses the position of the speaker in the room geometry to derive the audio signal of
each individually addressable driver, including both direct and reflective (upward firing) drivers.
Direct drivers are drivers that are intended to intersect the listening position before most of their
diffusion patterns are diffused by one or more reflective surfaces (such as floors, walls or
ceilings). Reflective drivers are drivers that are intended to be reflected before most of their
diffuse patterns cross the listening position as shown in FIG. If the system is in manual
configuration mode, 3D coordinates of each direct driver may be input into the system through
the UI. In the reflection driver, 3D coordinates of primary reflection are input to the UI. A laser or
similar technique may be used to visualize the diffusion pattern of the diffusion driver on the
surface of the room. Thus, 3D coordinates can be measured and manually entered into the
Driver position and aiming are usually performed using manual or automatic techniques. In some
instances, an inertial sensor may be incorporated into each speaker. In this mode, the central
speaker is designed as the "master" and its compass measurement is considered as a reference.
The other speakers then transmit the spread pattern and compass position of each of their
individually addressable drivers. Coupled with the room geometry, the difference between the
reference angle of the central speaker and each additional driver provides sufficient information
for the system to automatically determine whether the driver is direct or reflective.
The speaker position configuration may be fully automated if a 3D position (i.e. Ambisonic)
microphone is used. In this mode, the system sends a test signal to each driver and records the
response. Depending on the microphone type, the signal may need to be converted to an x, y, z
representation. These signals are analyzed to find the x, y, z components of the major first arrival.
Coupled with the room geometry, this usually provides sufficient information for the system to
automatically set 3D coordinates, direct or reflection of all speaker positions. Depending on the
room geometry, the hybrid combination of the three described methods of constructing the
speaker coordinates may be more efficient than using only one technique alone.
The speaker configuration information is one component necessary to configure the renderer.
Speaker calibration information is also needed to configure post-processing chains, ie, delay,
equalization, and gain. FIG. 12 is a flow chart illustrating the process steps for performing
automatic speaker calibration with a single microphone in one embodiment. In this mode, delay,
equalization and gain are calculated automatically by the system using a single omnidirectional
measuring microphone centered in the listening environment. As shown in diagram 1200, the
process begins at block 1202 by measuring the room impulse response of each single driver
alone. Next, at block 1204, the delay for each driver is calculated by finding the offset of the
cross-correlation peak of the acoustic impulse response (captured by the microphone) with the
directly captured electrical impulse response. At block 1206, the calculated delay is applied to
the directly captured (reference) impulse response. Next, at block 1208, the process determines
wideband and per-band gain values. The wideband and per-band gain values, when applied to the
measured impulse response, result in the smallest difference between the measured impulse
response and the directly captured (reference) impulse response. It performs a windowed FFT of
the measured and reference impulse responses, calculates the bin-to-bin magnitude ratio
between the two signals, and provides a median filter to the bin-to-bin magnitude ratio, Calculate
the gain value per band by averaging the gains for all of the bins completely contained in one
band, calculate the wideband gain by averaging all the gain per band, calculate the per band gain
It can be done by subtracting the wideband gain and applying a narrow room X-curve (-2 dB /
octave above 2 kHz). Once the gain value is determined at block 1208, the process at block 1210
subtracts the minimum delay from others to ensure that the final delay value is always such that
at least one driver in the system has zero additional delay. decide.
In the example of automatic calibration with multiple microphones, the delay, equalization and
gain are calculated automatically by the system using multiple omnidirectional measurement
microphones. The process is substantially the same as the single microphone technique, and is
repeated for each of the microphones, accepting that the results are averaged.
Alternative Playback System Instead of implementing the adaptive audio system throughout the
room or theater, the aspect of the adaptive audio system within a more localized application,
such as a television, computer, gaming terminal or similar device. It is also possible to implement.
This example relies in fact on the loudspeakers arranged in a flat plane corresponding to the
viewing screen or monitor surface. FIG. 13 illustrates the use of an adaptive audio system in an
exemplary television and sound bar use case. In general, television use cases may be limited in
spatial resolution (ie without surround or back speakers), often reduced equipment (TV, speakers,
sound bar speakers etc) quality and speaker position Providing the challenge of generating an
immersive listening experience based on: The system 1300 of FIG. 13 has speakers located at
standard television left and right positions (TV-L and TV-R) and left and right upward firing
drivers (TV-LH and TV-RH). The television 1302 may also include a sound bar 1304 or speakers
of some height array. Typically, the size and quality of television speakers are reduced due to
cost constraints and design choices as compared to single or home theater speakers. However,
the use of dynamic virtualization helps to overcome these drawbacks. In FIG. 13, dynamic
virtualization effects are shown for TV-L and TV-R speakers. Thus, people within a particular
listening environment 1308 will hear the horizontal elements associated with the appropriate
audio objects rendered individually in the horizontal plane. In addition, height elements
associated with the appropriate audio object are correctly rendered through the reflected audio
sent by the LH and RH drivers. The use of stereo virtualization in television L and R speakers
allows possible immersive dynamic speaker virtualization user experience through dynamic
control of speaker virtualization algorithm parameters based on object space information
provided by adaptive audio content L and R are similar to home theater speakers. This dynamic
virtualization may be used to generate the perception of objects moving along the sides of the
The television environment may also have HRC speakers as shown in the sound bar 1304. Such
HRC speakers may be steerable units that allow panning through the HRC array. It is
advantageous (especially with larger screens) to have a front firing center channel array with
individually addressable speakers that allow for discrete panning of audio objects through the
array to match the motion of video objects on the screen It is. This speaker is shown as having a
side firing speaker. These can be activated and used when the speaker is used as a sound bar.
Thus, the side firing driver provides more immersion due to the lack of surround or back
speakers. The concept of dynamic virtualization is shown for HEC / Soundbar speakers. Dynamic
virtualization is shown for the L and R speakers on the farthest side of the front firing speaker
array. Again, this may be used to generate the perception of objects moving along the sides of the
room. This modified central speaker may have more speakers and implement a steerable sound
beam with separately controlled sound zones. As also shown in the example implementation of
FIG. 13, the NFE speaker 1306 is placed in front of the main listening position 1308. The
inclusion of the NFE speaker can provide the additional immersion provided by the adaptive
audio system by moving the sound away from the front of the room to approach the listener.
For headphone rendering, the adaptive audio system preserves the creator's original intent by
adapting the HRTFs to spatial locations. When audio is reproduced by headphones, binaural
space virtualization can be achieved by applying a Head Related Transfer Function (HRTF).
HRTFs process audio and add perceptual cues that generate the perception of audio being played
back in three-dimensional space rather than through standard stereo headphones. The accuracy
of the spatial reproduction depends on the choice of appropriate HRTFs that may change based
on several factors including the audio channel being rendered or the spatial position of the
object. The use of spatial information provided by the adaptive audio system may result in the
selection of one or a continuously changing number of HRTFs that represent 3D space in order to
greatly enhance the reproduction experience.
The system also helps to add guided three dimensional binaural rendering and virtualization.
Similar to spatial rendering, with the new modified speaker types and positions, it is possible to
generate cues to simulate sounds coming from both horizontal and vertical axes through the use
of three-dimensional HRTFs. The audio format prior to providing only channel and fixed speaker
position information rendering is more limited.
Headphone Rendering System With adaptive audio format information, the binaural 3D
rendering headphone system can be used to indicate which elements of the audio are suitable for
rendering in both horizontal and vertical planes, details and helpfulness Information. Specific
content may rely on the use of overhead speakers to provide a greater sense of envelopment.
These audio objects and information can be used for binaural rendering, which is perceived as
being overhead of the listener when using headphones. FIG. 14A shows, in one embodiment, a
simplified representation of a three-dimensional binaural headphone virtualization experience for
use in an adaptive audio system. As shown in FIG. 14A, a headphone set 1402 used to reproduce
audio from an adaptive audio system has an audio signal 1404 in the standard xy plane as well
as in the z plane. Thus, the heights associated with a particular audio object or sound are played
as if they came from above or below the sound of x, y origin.
FIG. 14B is a block diagram of a headphone rendering system in one embodiment. As shown in
FIG. 1410, the headphone rendering system takes an input signal that is a combination of an Nchannel bed 1412 and M objects 1414 including position and / or trajectory metadata. For each
channel of the N-channel bed, the rendering system calculates left and right headphone channel
signals 1420. A temporally invariant binaural room impulse response (BRIR) filter 1413 is
applied to each of the N bed signals. A time-varying BRIR filter 1415 is applied to the M object
signals. The BRIR filters 1413 and 1315 function to provide the listener with the impression that
he is in a room (eg, a small theater, a large concert hall, an arena, etc.) with specific audio
characteristics, and the effect of the sound source and the listener Includes head and ear effects.
The output from each of the BRIR filters is input to left and right channel mixers 1416 and 1417.
The mixed signals are then equalized through individual headphone equalization processes 1418
and 1419 to generate left and right headphone channel signals, Lh, Rh 1420.
FIG. 14C shows the configuration of the BRIR filter used in the headphone rendering system in
one embodiment. As shown in FIG. 1430, BRIR is basically the sum 1438 of direct path response
1432 and reflections including specular effects 1434 and diffuse effects 1436 in the room. Each
path used in the sum includes a source transfer function, a room surface response (other than
direct path 1432), a distance response, and an HRTF. Each HRTF is designed to generate the
correct response at the entrance to the listener's left and right ear canal, for the source azimuth
and elevation specified for the listener under anechoic conditions. The BRIRs are designed to
generate correct responses at the entrance to the listener's left and right ear canal with the
listener at the room location with respect to the source location, source directivity and
orientation in the room.
The BRIR filter applied to each of the N bed signals is fixed at a specific position associated with a
specific channel of the audio system. For example, a BRIR filter applied to the center channel
signal may correspond to a source located at 0 degrees azimuth and 0 degrees altitude. Thus, the
listener gets the impression that the sound corresponding to the central channel comes from the
source immediately in front of the listener. Similarly, the BRIR filters applied to the left and right
channels may correspond to sources located at +/- 30 degrees azimuth. The BRIR filters applied
to each of the M object signals change with time and are adapted based on position and / or
trajectory data associated with each object. For example, the position data of object 1 may
indicate that the object is immediately behind the listener at time t0. In such an example, the
BRIR filter corresponding to the position immediately behind the listener is applied to the object
1. Furthermore, the position data of object 1 may indicate that the object is immediately above
the listener at time t1. In such an example, the BRIR filter corresponding to the position
immediately above the listener is applied to the object 1. Similarly, for each of the remaining
objects 2-m, a BRIR filter is applied that corresponds to position data that changes with time of
each object.
Referring to FIG. 14B, after the left ear signals corresponding to each of the N bed channels and
the M objects are generated, they are mixed together in mixer 1416 to form an overall left ear
signal. Be done. Similarly, after the right ear signals corresponding to each of the N bed channels
and the M objects are generated, they form an overall transfer function from the left headphone
transducer to the entrance of the left ear canal of the listener To be mixed together in mixer
1417. This signal is played through the left headphone transducer. Similarly, the entire right ear
signal is equalized 1419 to compensate for the acoustic transfer function from the right
headphone transducer to the entrance of the right ear canal of the listener. This signal is then
played through the right headphone transducer. The end result provides a 3D audio sound scene
that wraps up in the listener.
<HRTF Filter Set> With regard to the actual listener in the listening environment, the human
torso, head and auricle (outer ear) have head related transfer functions (HRTF in the frequency
domain) or head related impulses. Generate a set of boundaries that can be modeled using ray
tracing and other techniques to simulate response (head-related impulse response in the time
domain: HRIR). These elements (torso, head, and pinna) can be modeled individually in a way that
the model is later structurally coupled to a single HRIR. Such models allow for advanced
customization based on personified measurements (head radius, neck height, etc.), and the
binaural cues needed for localization in the horizontal (azimuth) plane as well as the vertical
(altitude) Provide a weak low frequency cue in the plane. FIG. 14D shows a basic head and torso
model 1440 of an incident plane wave 1442 in free space that can be used with an embodiment
of a headphone rendering system.
Auricles are known to provide powerful advanced cues, as well as anterior and posterior cues.
These are usually described as spectral features in the frequency domain, which often is a set of
notches that are related to frequency and move as the source height moves. These features are
also present in the time domain using HRIR. They appear as a set of peaks and dips in the
impulse response that move in a powerful and systematic manner as the altitude changes (some
weak movements corresponding to azimuthal changes also exist).
In one embodiment, HRTF filter sets for use with the headphone rendering system are
constructed using a commonly available HRTF database to gather data regarding features of the
pinna. The database is transformed to a common coordinate system and outlier subjects are
eliminated. The chosen coordinate system is along an "interaural axis" to allow elevation features
to be tracked independently for any given azimuth. The impulse response is extracted, time
aligned and oversampled for each spatial location. The effects of head shadow and body
reflection are eliminated as much as possible. A weighted average of the features is performed
for all given subjects for any given spatial location, and weighting is done such that features that
change with altitude are given greater weight. The results are then averaged, filtered and
downsampled back to the common sample rate. Average measurements for human
anthropometric measurements are used for the head and torso models and combined with the
averaged torso data. FIG. 14E shows a structural model of a pinna feature for use with the HRTF
filter in one embodiment. In one embodiment, the structural model 1450 is used with room
modeling software to optimize the configuration of drivers in the listening environment and
exported to a format for rendering objects for playback using speakers or headphones it can.
In one embodiment, the headphone rendering system includes a method of compensating for
HETF to improve binaural rendering. This method models and derives the HETF compensation
filter in the Z domain. The HETF is affected by the reflection between the inner surface of the
headphone and the surface of the associated outer ear. If a binaural recording is generated at the
entrance to block the ear canal, for example as from a B & K 4100 dummy head, the HETF is
defined as the transfer function to the sound pressure signal at the entrance of the ear canal
isolated from the headphone input. If binaural recordings are generated at the tympanic
membrane, for example as from a "HATS acoustic" dummy head, the HETF is defined as the
transfer function from the headphone input to the sound pressure signal at the tympanic
Considering that the reflection coefficient (R1) of the inner surface of the headphone depends on
the frequency and the reflection coefficient (R2) of the outer ear surface or tympanic membrane
also depends on the frequency, the reflection coefficient from the headphone and the reflection
from the outer ear surface in the Z region The product with coefficients (i.e., R1 * R2) can be
modeled as a first-order Infinite Impulse Response (IIR) filter. Furthermore, given the time delay
between reflections from the inner surface of the headphones and the surface of the outer ear,
and the presence of second and higher order reflections between them, the Z region The HETF is
modeled as a high order IIR filter H (z) formed by the sum of products of reflection coefficients
with different time delays and orders. Furthermore, the HETF inverse filter is modeled using the
IIR filter E (z), which is the reciprocal of H (z).
From the measured impulse response of the HETF, the process obtains e (n) which is the time
domain impulse response of the inverse filter of the HETF. Thus, both the phase and amplitude
spectral responses of the HETF are equalized. This also derives the parameters of the inverse
filter E (z) from the e (n) sequence using Pony's method as an example. To obtain stable E (z), the
order of E (z) is set to the correct number and only the first M samples of e (n) are selected to
derive the parameters of E (z) .
This headphone compensation method equalizes both the phase and amplitude spectrum of the
HETF. Furthermore, by using the described IIR filter E (z) as a compensation filter, it imposes less
computational cost and shorter time delay compared to other methods to achieve equivalent
compensation instead of FIR filter .
Metadata Definition In one embodiment, an adaptive audio system has a component that
generates metadata from an original spatial audio format. The methods and components of
system 300 include an audio rendering system configured to process one or more bitstreams
that include both conventional channel-based audio elements and audio object coding elements.
A new enhancement layer containing audio object coding elements is defined and added to one
of the channel based audio codec bitstream or audio object bitstream. This approach allows the
renderer to process a bitstream containing an enhancement layer for use with existing
loudspeakers and driver designs or next generation loudspeakers that utilize individually
addressable drivers and driver definitions. Spatial audio content from the spatial audio processor
comprises audio objects, channels, and location metadata. Objects are assigned one or more
speakers according to position metadata and the position of the playback speaker when
rendered. Additional metadata may be associated with the object to change the playback position
or to limit the speakers to be used for playback. Metadata provides rendering cues that control
spatial parameters (eg, position, velocity, intensity, sound quality, etc.) and specify which drivers
or speakers in the listening environment will play individual sounds during release In response to
the technician's mixing input, it is generated within the audio workstation. Metadata is associated
with the individual audio data at the workstation for packaging and transfer by the spatial audio
FIG. 15 is a table that illustrates specific metadata definitions for use in an adaptive audio system
for a listening environment in one embodiment. As shown in table 1500, the metadata definition
includes audio content type, driver definition (number, characteristics, position, launch angle),
control signals for active steering / tuning, and calibration information including room and
speaker information. Have.
Upmixing An embodiment of the adaptive audio rendering system comprises an upmixer based
on reflection of audio channels and factoring into direct subchannels. The direct subchannel is
the part of the input channel that is routed to the driver that supplies the listener with an initial
reflected acoustic wave. A reflective or diffuse subchannel is a portion of the original audio
channel that is considered to have a major portion of the energy of the driver reflected from the
plane of perception and the wall. Thus, the reflective sub-channel preferably arrives at the
listener after diffusion to the local acoustic environment, or in particular, the original channel
that is reflected from a point on the surface (e.g. the ceiling) to another location in the room
Represents the part of The physical orientation of one subchannel's driver relative to the physical
orientation of the other subchannel's driver adds acoustic spatial diversity to each incoming
signal so that each subchannel is routed to an independent speaker driver. In one embodiment,
the reflective subchannels are sent to one or more upstream firing speakers directed to the
surface for indirect transmission of sound to the desired location.
It should be noted that, in the context of the upmixing signal, the reflected acoustic waveform
optionally is between the reflection from a particular surface and the reflection from any surface
that results in the diffusion of energy from the normally undirected driver. I do not distinguish. In
the latter example, the sound waves associated with this driver are ideally undirected (ie, the
diffuse waveform is such that the sound does not come from a single direction).
FIG. 17 is a flowchart illustrating the process of decomposing an input channel into subchannels
in one embodiment. The entire system is designed to operate on multiple input channels. The
input channel comprises a hybrid audio stream of audio content based on space. As shown in
process 1700, the steps include breaking down or separating input channels into sub-channels
sequentially in order of operation. At block 1702, the input channel is divided into a first
separation between the reflective subchannel and the direct subchannel in a coarse
decomposition step. Next, the original decomposition is refined in a subsequent step, block 1704.
At block 1706, processing determines whether the resulting separation between the reflective
subchannel and the direct subchannel is optimal. If the separation is not yet optimal, an
additional decomposition step 1704 is performed. If it is determined at block 1706 that the
separation between the reflection and the direct subchannel is optimal, then an appropriate
speaker feed is generated and transmitted to the final mix of reflection and direct subchannel.
With respect to the decomposition process 1700, it is important to note that at each stage of the
process, energy conservation is conserved between the reflective subchannel and the direct
subchannel. For this calculation, the variable a is defined as the part of the input channel directly
associated with the subchannel, and ˜ is defined as the part associated with the spreading
subchannel. The relationship to the determined energy conservation can then be expressed
according to the following equation:
Here, it is as following Formula.
Where x is the input channel and k is the transform index.
In one embodiment, the solution is calculated for frequency domain quantities in the form of
complex discrete Fourier transform coefficients, real based MDCT transform coefficients, or
quadrature mirror filter (QMF) subband coefficients (real or complex), During processing, it is
considered that a forward transform is applied to the input channel and a corresponding inverse
transform is applied to the output subchannel.
FIG. 19 is a flowchart 1900 illustrating the process of decomposing an input channel into
subchannels in one embodiment. For each input channel, the system calculates, in step 1902,
Inter-Channel Correlation (ICC) between the two nearest neighbor channels. The ICC is generally
calculated according to the following equation:
Here, SDi is the frequency domain coefficient of the input channel of index i, and SDj is the
coefficient of the spatially adjacent input audio channel next to index j. The E {} operator is an
expectation operator, which can be implemented using a fixed averaging over a set number of
blocks of audio, or a smoothing algorithm where smoothing is performed on each frequency
domain coefficient across blocks Can be implemented as This smoothing can be implemented as
an exponential smoothing using an infinite impulse response (IIR) filter technique.
The geometric mean between the ICCs of these two adjacent channels is calculated, this value
being a number between -1 and 1. The value of a is then set as the difference between 1.0 and
this average. The ICC generally describes how and how common the signal is between the two
channels. Signals with high inter-channel correlation are routed to the reflection channel. On the
other hand, signals that are unique to nearby channels are routed directly to the subchannels.
This operation can be described according to the following exemplary pseudo code:
Where pICC represents the ICC of the i-1 input channel spatially adjacent to the current input
channel i, and niCC is the ICC of the input channel indexed i + 1 spatially adjacent to the current
input channel i Represent. At step 1904, the system calculates transient scaling terms for each
input channel. These scaling factors contribute to reflection versus direct mix calculations. Here,
the amount of scaling is proportional to the energy in the transient. In general, it is desirable that
transient signals be routed directly to subchannels. Thus, a is a positive transient detection event
and is compared to a scaling factor sf which is set to 1.0 (or near 1.0 for weak transients).
Here, the index i corresponds to the input channel i. Each transient scaling factor sf has a hold
parameter and a decay parameter to control how the scaling factor evolves over time after a
transient. These retention and attenuation parameters are usually on the order of milliseconds,
but the attenuation back to the nominal value of a can spread towards more than one second.
Using the value of a calculated at block 1902 and the transient scaling factor calculated at 1904,
the system at step 1906 reflects each input channel into the direct and subchannels so that the
total energy between the subchannels is conserved. To separate.
As an optional step, at step 1908, the reflected channel can be further decomposed into echo and
non-echo components. The non-reflecting subchannels may be summed back directly to the
subchannels or sent to a dedicated driver at the output. Since it is not known which linear
transformation has been applied to echo the input signal, blind deconvolution or related
algorithms (such as blind source separation) are applied.
The second optional step is to uncorrelate the reflected channel directly from the direct channel
at step 1910 using a decorrelator operating for each frequency domain transform across the
block. In one embodiment, the uncorrelator comprises a number of delay elements (the
millisecond delay corresponds to the block integer delay multiplied by the length of the basic
time-frequency transform) and the time limited function Z. It has an all-pass IIR (infinite impulse
response) filter with filter coefficients that can move arbitrarily within the circle of the region. At
step 1912, the system performs equalization and delay functions on the reflection and direct
channels. In the usual case, the direct subchannel is delayed by an amount that makes the
acoustic wavefront from the direct driver at the listening position phase coherent with the
fundamental reflected energy wavefront (in the sense of mean square energy error). Equally,
equalization is applied to the reflection channel to compensate for the expected (or
measurement) spread of the room in order to optimally match the sound quality between the
reflection and the direct subchannel.
FIG. 18 illustrates an upmixer system that processes audio channels into multiple reflections and
direct subchannels in one embodiment. As shown in system 1800, K subchannels are generated
for N input channels 1802. For each input channel, the system produces reflections (also
referred to as "diffuse") and direct subchannels, outputting a total of K * N subchannels 1820. In
the standard example, K = 2 and take into account one reflective subchannel and one direct
subchannel. The N input channels are input to the ICC calculation component 1806 as well as to
the transient scaling term information computer 1804. The a coefficients are calculated in
computer 1808 and combined with the transient scaling term for input to separation process
1810. This process 1810 separates the N input channels into reflections and direct outputs,
resulting in N reflection channels and N direct channels. The system performs blind
deconvolution processing 1812 on the N reflected channels and then performs decorrelation
operation 1816 on these channels. Acoustic channel pre-processor 1818 accepts the N direct
channels and the uncorrelated N reflected channels to generate K * N sub-channels 1820.
Another option is to control the algorithm through the use of environmental sensing
microphones that may be present in the room. This allows the calculation of the direct-toreverter (DR) ratio of the room. By means of the DR ratio, a first control may be possible when
determining the optimal separation between the spreading subchannel and the direct
subchannel. In particular, in reverberant rooms, the spreading subchannel has more spreading
applied to the listener position, so the mix between the spreading subchannel and the direct
subchannel is in the blind deconvolution and uncorrelated steps. It is reasonable to assume that
they can be affected. In particular, in rooms with very little reflected acoustic energy, the amount
of signal routed to the diffuse subchannel may be increased. In addition, microphone sensors in
the acoustic environment may determine the optimal equalization to be applied to the diffuse
subchannels. The adaptive equalizer may ensure that the spreading subchannels are optimally
delayed and equalized such that wavefronts from both subchannels combine phase coherently at
the listening position.
Virtualizer In one embodiment, the adaptive audio processing system comprises an object via a
plurality of loudspeaker pairs that may have one or more individually addressable drivers
configured to reflect sound. Has a component that virtually renders audio based on. This
component performs virtual rendering of object-based audio through binaural rendering of each
object, followed by panning between multiple crosstalk cancellation circuits that provide
corresponding multiple pairs of speakers of the resulting stereo binaural signal.
This improves the spatial impression of both the listener inside and outside the sweet spot of the
crosstalk canceller, as compared to a conventional virtualizer using just a single speaker pair.
In other words, the crosstalk cancellation overcomes the drawback that it relies heavily on the
listener sitting in position relative to the loudspeaker assumed in the crosstalk canceller design.
If the listener does not sit in this so-called "sweet spot", the crosstalk removal effect is partially or
completely dropped and the spatial impression intended by the binaural signal is not perceived
by the listener. This is a problem especially with multiple listeners. In this case, only one of the
listeners may substantially occupy the sweet spot.
In spatial audio reproduction systems, sweet spots can be extended to more than one listener by
utilizing more than two speakers. This is most often achieved by surrounding a large sweet spot
with more than two speakers, as in a 5.1 surround system. In such systems, for example, the
sound intended to be heard from the rear is generated by speakers physically located behind all
the listeners. Thus, all listeners perceive these sounds as coming from behind. On the other hand,
in virtual space rendering via stereo loudspeakers, the perception of the audio from behind is
controlled by the HRTFs used to generate the binaural signal and is only correctly perceived by
the listener at the sweet spot. Listeners outside the sweet spot are likely to perceive the sound
coming out of the stereo speakers in front of them. However, as mentioned above, the installation
of such surround systems is not practical for many consumers, or they are simply in front of the
listening environment where they are often co-located with the television display. You may prefer
to keep the speakers on. By using multiple speaker pairs in conjunction with virtual space
rendering, the virtualizer in one embodiment is outside the sweet spot in a way that places all
utilized speaker pairs in substantially the same place Combine the benefits of more than two
speakers for the listener and maintain or improve the listener's experience inside the sweet spot.
In one embodiment, virtual space rendering is extended to multiple loudspeaker pairs by panning
the binaural signal generated from each audio object among the multiple crosstalk cancellers.
Panning between the crosstalk cancellers is controlled by the position associated with each audio
object, which is the same position used to select the binaural filter pair associated with each
object. A plurality of crosstalk cancellers are designed for and supplied to a corresponding
plurality of loudspeaker pairs having different physical positions and / or orientations
respectively with respect to the intended listening position. Multiple objects at different locations
in space may be rendered simultaneously. In this example, binaural signals may be represented
by the sum of object signals and their associated HRTFs applied. With a multi-object binaural
signal, the complete rendering chain for generating the loudspeaker signal in a system with M
loudspeaker pairs can be expressed as:
The audio signal of the i-th object out of oi = N Binaural filter pair of the i-th object given by Bi ==
HRTF {pos (oi)} Coefficient C j = Crosstalk canceller matrix of the j-th speaker pair Stereo speaker
signals sent to the s j = j-th speaker pair M panning coefficients associated with each object i vary
with time depending on the following equation Calculated using a panning function that takes
object position as input.
In one embodiment, for each of the N object signals Oi, a binaural filter pair Bi selected according
to the object position pos (oi) is first applied to generate a binaural signal.
At the same time, the panning function generates M panning coefficients α il. . . Calculate αiM.
Each panning coefficient is separately multiplied by the binaural signal to generate M scaled
binaural signals. For each of the M crosstalk cancellers Cj, the j-th scaled binaural signal from all
N objects is added. This summed signal is then processed by the crosstalk canceller to generate
the j-th loudspeaker signal pair sj reproduced through the j-th loudspeaker pair.
In order to extend the benefits of multiple loudspeaker pairs to listeners that are outside the
sweet spot, the panning function is used to help communicate the desired physical position of the
object to these listeners, so as to It is configured to distribute the signal. For example, if the
object is to be heard from overhead, the panning device should pan the object to a speaker pair
that most effectively reproduces the sense of height for all listeners. If the object is to be audible
from the side, the panning device should pan the object to a speaker pair that most effectively
reproduces the sense of breadth for all listeners. More generally, the panning function should
compare the desired spatial position of each object with the spatial reproduction capabilities of
each loudspeaker pair in order to calculate the optimal set of panning functions.
In one embodiment, three speaker pairs are used, all in the same place in front of the listener.
FIG. 20 illustrates a loudspeaker configuration for virtual rendering of audio based on an object
using reflective height loudspeakers in one embodiment. The speaker array or sound bar 2002
has a number of co-located drivers. As shown in FIG. 2000, the first driver pair 2008 points
forward toward the listener 2001, the second driver pair 2006 points sideways, and the third
driver pair 2004 points straight or at an angle. These pairs are labeled front, side, and height and
are associated with crosstalk cancelers CF, CS, and CH, respectively.
A parametric spherical head model HRTF is used for both the generation of the crosstalk
canceller associated with each of the speaker pairs as well as the binaural filter of each audio
object. These HRTFs depend only on the angle of the object to the median plane of the listener.
As shown in FIG. 20, the angle in this median plane is defined as zero degrees, the angle to the
left is defined as negative, and the angle to the right is defined as positive. In driver layout 2000,
driver angle θ c is the same for all three driver pairs. Thus, the crosstalk canceller matrix C is
the same for all three pairs. If each pair is not at approximately the same position, the angles may
be set to be different for each pair.
Each audio object signal oi is associated with a possibly time-varying position given by Cartesian
coordinates {xi, yi, zi}. Since the parameters HRTF used in the preferred embodiment do not
include elevation cues, only the x and y coordinates of the object position are used in calculating
the binaural filter pair from the HRTF function. These {xi, yi} coordinates are converted to
equivalent radius and angles {ri, θi}. Here, the radius is normalized to be between 0 and 1. The
parameters do not depend on the distance from the listener. Thus, the radius is not incorporated
into the calculation of the left and right binaural filters, as follows:
When the radius is zero, the binaural filter is simply one across all frequencies, and the listener
can hear the object signal equally in both ears. This corresponds to the case where the object
position is exactly in the head of the listener. When the radius is one, the filter is equal to the
parameter HRTF defined by the angle θi. Taking the square root of the radius term biases this
interpolation of the filter towards the HRTF. This preserves spatial information well. It should be
noted that this calculation is necessary because parametric HRTF models do not incorporate
distance cues. Different HRTF sets may incorporate such cues. In this case, the interpolation
described by the above equation is not necessary.
For each object, the panning coefficients for each of the three crosstalk cancellers are calculated
from the object position {xi, yi, zi} for each canceller direction. The upward firing driver pair
2004 is configured to transmit the sound from above by reflecting the sound from the ceiling.
Thus, its associated panning factor is proportional to the height coordinate zi. The panning
coefficients of the front and side firing driver pairs 2006, 2008 are governed by the object angle
θi derived from the {xi, yi} coordinates. If the absolute value of θi is less than 30 degrees, the
object is panned completely to the front pair 2008. If the absolute value of θi is between 30 and
90 degrees, the object is panned between the front and side pairs. When the absolute value of θi
is greater than 90 degrees, the object is fully panned to side pair 2006. With this panning
algorithm, listeners in the sweet spot benefit from all three crosstalk cancellers. In addition, a
high degree of perception can be added to the upstream firing pair, and the side firing pair can
add an element of diffusion of the object being mixed to the side and back to improve perceived
entrapment. For listeners that are outside of the sweet spot, the canceller loses many of their
effects, but the listener does not perceive the altitude from the upstream firing driver pair 2004,
as well as direct and diffuse sound from panning from front to side. Understand the change
In one embodiment, the above described virtualization techniques are applied to adaptive audio
formats that involve mixing of dynamic object signals with fixed channel signals, as described
above. Fixed channel signals may be processed by assigning fixed spatial positions to each
As shown in FIG. 20, the preferred driver layout may also have a single separate central speaker.
In this example, the central channel may be routed directly to the central speaker rather than
being processed separately. In the example where a pure channel based legacy signal is rendered
in the system, all processing elements are constant over time, as each object position is static. In
this example, all three elements may be pre-computed once at system startup. Furthermore, the
binaural filter, the panning coefficients, and the crosstalk canceller may be pre-combined into M
fixed filter pairs for each fixed object.
FIG. 20 shows just one possible driver layout used in conjunction with a system for virtual
rendering of object based audio. Many other configurations are also possible. For example, the
side pairs of speakers may be eliminated, leaving only the front facing speaker and the top facing
speaker. Also, the upward facing pair may be replaced by a loudspeaker pair located near the
ceiling above the forward facing pair and pointing directly to the listener. This configuration may
be extended, for example, to a plurality of speaker pairs spaced from bottom to top along the side
of the television screen.
Features and Capabilities As noted above, the adaptive audio ecosystem allows content creators
to embed the spatial intent (position, size, velocity, etc.) of the mix into the bitstream with
metadata. This allows for an amazing amount of flexibility in spatial reproduction of audio. From
a spatial rendering point of view, the adaptive audio format allows content creators to adapt the
mix to the exact location of the speakers in the room to avoid spatial distortion caused by
playback system geometry not identical to the authoring system Make it In current consumer
audio reproductions where only audio in the speaker channel is sent, the content creator's intent
is unknown as to locations in the room other than fixed speaker locations. In the current channel
/ speaker framework, the only information known is that a particular audio channel should be
sent to a particular speaker having a predetermined position in the room. In an adaptive audio
system, using metadata conveyed through the generation and distribution pipeline, the rendering
system can use this information to render the content to match the content creator's original
intent. For example, for different audio objects, the relationship between speakers is known. By
providing the spatial position of the audio object, the intent of the content creator is known,
which can be "mapped" to the user's speaker configuration, including the speaker position. With
a dynamic rendering audio rendering system, this rendering can be updated and enhanced by
adding additional speakers.
The system also makes it possible to add guided three-dimensional binaural rendering. Many
attempts have been made to create a more immersive audio rendering experience through the
use of new speaker designs and configurations. These include the use of bi-pole and dipole
speakers, side firing, rear firing and upward firing drivers. In the previous channel and fixed
speaker position systems, the determination of which elements of the audio should be sent to
these modified speakers was at best a guesswork. With an adaptive audio format, the rendering
system has detailed and useful information on which elements of the audio are suitable to be sent
to the new speaker configuration. That is, the system allows control of which audio signal should
be sent to the front firing driver and which should be sent to the upstream firing driver. For
example, adaptive audio movie content relies heavily on the use of overhead speakers to provide
a greater sense of envelopment. These audio objects and information may be sent to the
upstream firing driver to provide reflected audio in the listening environment to produce similar
The system also makes it possible to adapt the mix to the exact hardware configuration of the
reproduction system. There are many different possible speaker types and configurations in
consumer render devices such as televisions, home theaters, sound bars, portable music player
docks, etc. When these systems transmit channel-specific audio information (ie left and right
channels or standard multi-channel audio), the system must process the audio to properly fit the
capabilities of the rendering device . A standard example is when standard stereo (left, right)
audio is sent to a sound bar with two or more speakers. In current systems where only audio in
the speaker channel is transmitted, the intent of the content creator is unknown and in some
cases the more immersive audio experience made possible by the extension device causes the
audio to be reproduced in hardware It must be generated by an algorithm that infers how to
change. An example of this is using PLII, PLII-z, or next-generation surround to "upmix" channelbased audio to speakers with more than the original number of channel feeds. In an adaptive
audio system, with metadata conveyed throughout the generation and distribution pipeline, the
reproduction system can use this information to reproduce the content to more closely match the
content creator's original intent . For example, some sound bars have side firing speakers to
create a sense of envelopment. For adaptive audio, spatial information and content type
information (ie, conversations, music, environmental effects, etc.), such as a TV or A / V receiver
to transmit only the appropriate audio to these side-firing speakers It can be used by the
soundbar when controlled by the rendering system.
The spatial information conveyed by the adaptive audio enables dynamic rendering of content
that is aware of the location and type of presence of speakers. Furthermore, information about
the relationship of one or more listeners to the audio rendering device is now potentially
available and can be used in rendering. Most gaming terminals have camera attachments and
intelligent image processing that can determine the location and identity of people in the room.
This information may be used by the adaptive audio system to modify the rendering to more
accurately convey the content creator's creative intent based on the location of the listener. For
example, in almost all cases, the audio that is rendered for playback is an ideal "sweet spot", often
equidistant from each speaker, and in the same position as the sound mixer was in during
content creation. Assume that the listener is located at However, in many cases people are not in
this ideal position and their experience does not match the creative intentions of the mixer. A
standard example is when the listener is sitting on the left side of the room on a chair or sofa in
the living room. In this example, the sound being reproduced from the closer speaker on the left
will be louder and the spatial perception of the audio mix will be perceived as distorted to the
left. By understanding the location of the listener, the system can adjust the rendering of the
audio to lower the sound level at the left speaker and raise the sound level at the right speaker to
rebalance the audio mix and make it perceptually correct. . It is also possible to delay the audio to
compensate for the distance from the sweet spot to the listener. The listener position is detected
through the use of a camera or a specific modified remote control with embedded signaling that
signals the listener position to the rendering system.
In addition to the use of standard loudspeakers and loudspeaker positions, it is also possible to
use beam steering techniques to generate varying sound field "zones" depending on the listener
position and content to address the listening position . Audio beamforming uses an array of
speakers (usually 8 to 16 horizontally spaced speakers) and uses phase manipulation and
processing to generate a steering sound beam. The beamforming speaker array enables the
generation of audio zones in which the audio is primarily audible, which can be used to direct
specific sounds or objects to specific spatial locations by selective processing. An obvious use
case is to use the speech enhancement post-processing algorithm to process the speech in the
soundtrack and direct the audio object directly to the hearing impaired user.
Matrix Coding In some examples, an audio object may be a desired component of adaptive audio
content. However, due to bandwidth limitations, it may not be possible to transmit both channel /
speaker audio and audio objects. In the past, matrix coding has been used to convey more audio
information than is possible in a given distribution system. For example, this is the case at the
beginning of a movie where multi-channel audio was generated by the sound mixer but the film
format only provided stereo audio. Matrix coding was used to intelligently downmix multichannel audio to two stereo channels. This was then processed by a specific algorithm to
regenerate exact approximations of multi-channel mixes from stereo audio. Similarly, intelligently
downmixing audio objects into basic speaker channels, and adaptive audio rendering through
object extraction and adaptive audio metadata and advanced time and frequency sensitive nextgeneration surround algorithms It is possible for the system to render correctly in space.
Furthermore, when there is bandwidth limitations of the transmission system for audio (eg 3G
and 4G wireless applications), transmitting matrix-encoded spatially diverse multi-channel beds
together with individual audio objects Get a profit from One use of such a transmission method is
the transmission of sports broadcasts with two different audio beds and multiple audio objects.
The audio bed may represent multi-channel audio captured at outdoor bleachers portions of two
different teams. Audio objects may represent different announcers that empathize with one or
the other team. Using standard coding, the 5.1 representation of each bed together with two or
more objects may exceed the bandwidth limitations of the transmission system. In this example,
if each of the 5.1 beds was matrix encoded into a stereo signal, the two beds originally captured
as 5.1 channels would be audio instead of 5.1 + 5.1 + 2 or 12.1 channels. The two channels bed
1, 2, channel bed 2, object 1, and object 2 may be transmitted as only four channels of.
Position- and Content-Dependent Processing The adaptive audio ecosystem allows content
creators to add information about the content that can be generated to individual audio objects
and conveyed to the rendering system. This allows a lot of flexibility in the processing of audio
prior to reproduction. Processing may be adapted to the position and type of object through
dynamic control of speaker virtualization based on object position and size. Speaker virtualization
represents a method of processing audio so that a virtual stylus is perceived by a listener. This
method is often used for stereo speaker reproduction when the source audio is multi-channel
audio with surround speaker channel feed. Virtual speaker processing changes the surround
speaker channel audio so that when played on a stereo speaker, the surround audio elements are
virtualized to the side and back of the listener, as if the virtual speakers were placed there .
Currently, the position attribute of the virtual speaker position is static since the intended
position of the surround speaker is fixed. However, in adaptive audio content, the spatial position
of different audio objects is dynamic and different (ie, unique for each object). Post processing
such as virtual speaker virtualization dynamically controls parameters such as the speaker
position angle of each object and then combines the rendered output of several virtualization
objects to make the sound mixer's intent By creating a more immersive audio experience that is
more rigorously represented, it can be controlled in an even more information-based manner.
Perceptual processing of fixed channel and dynamic object audio in addition to standard
horizontal virtualization of audio objects and gaining the perception of audio height reproduction
from standard pairs of stereo speakers in normal, horizontal, position It is possible to use height
Specific effects or extensions can be carefully applied to the appropriate type of audio content.
For example, conversational enhancement may be applied only to conversational objects. Speech
enhancement refers to a method of processing audio that includes speech such that the audibility
and / or intelligibility of speech is increased and / or enhanced. In many instances, audio
processing applied to speech is inappropriate for non-speech audio content (i.e., music,
environmental effects, etc.) and can result in undesirable audible artifacts. With adaptive audio,
an audio object can contain only speech in a piece of content and can be labeled accordingly.
Thus, the rendering solution may selectively apply speech enhancement only to speech content.
Furthermore, if the audio object is only conversations (and not a mixture of conversations and
many content, as is often the case), the conversation enhancement process can handle the
conversations exclusively (thus, for any number of content) Limit any processing that is
Similarly, audio response or equalization management can be tailored to particular audio
characteristics. For example, bass management (filtering, attenuation, gain) based on their type
targeting specific objects. Bass management refers to selectively separating and processing only
bass (or low) frequencies in a particular piece of content. In current audio systems and
distribution mechanisms, this is a "blind" process applied to all audio. In adaptive audio, specific
audio objects for which bass management is appropriate are identified by appropriately applied
rendering processes and metadata.
Adaptive audio systems also provide object-based dynamic range compression. Traditional audio
tracks have the same duration as the content itself. Audio objects, on the other hand, may occur
for a limited amount of time in the content. Metadata associated with an object may have level
related information regarding its average and peak signal amplitudes as well as its onset or rise
time (especially of transition material). This information may allow the compressor to better
adapt its compression and time constraints (rise, release, etc) to better fit the content.
The system implements automatic loudspeaker-room equalization. Loudspeakers and room
acoustics play an important role in introducing audible coloring into the sound and thereby
affecting the sound quality of the reproduced sound. Furthermore, the sound is position
dependent due to room reflections and loudspeaker directivity variations as well as the perceived
sound quality changes significantly at different listening positions due to this variation. The
AutoEQ (automatic room equalization) function provided in the system is automatic loudspeakerroom spectrum measurement and equalization, automatic time delay compensation (provides
least squares based on proper image and possibly relative speaker position detection) And
through level setting, bass redirection based on loudspeaker top space capabilities, and optimal
splicing of the main loudspeakers with subwoofers help to alleviate some of these problems. In a
home theater or other listening environment, the adaptive audio system has certain additional
features as follows. (1) Reproduction room-automatic target curve calculation based on sound
(this is a public issue of research on equalization in home viewing rooms), (2) the effect of modal
reduction control using time-frequency analysis, (3) understanding of the parameters derived
from the measurements governing envelopment / magnitude / source width / intelligibility, and
their control to provide the best possible listening experience, (4) front and "other" Directional
filtering incorporating a head model to match the sound quality between the loudspeakers, and
(5) detection of the spatial position of the loudspeakers in a discrete setup for the listener, and
spatial remapping. The mismatch in sound quality between the loudspeakers is basically evident
for the particular content being panned between the front-anch loudspeakers (e.g. center) and
the surround / back / wide / high loudspeakers.
In general, the adaptive audio system allows an inspiring audio / video reproduction experience,
especially at large screen sizes in a home environment, when the reproduced spatial position of
some audio elements matches the image elements on the screen Do. An example is spatially
matching a conversation in a movie or television show with the person or character speaking on
the screen. In audio based on normal speaker channels, there is no easy way to determine where
the speech should be spatially positioned to match the position of the person or character on the
screen. Due to the audio information available in the adaptive audio system, this type of audio /
visual registration can be easily achieved even in home theater systems that feature larger screen
sizes. Visual position and audio space alignment may also be used for non-character /
conversation objects such as cars, tracks, animations, etc.
The adaptive audio ecosystem also enables extended content management by allowing the
content creator to add information about the content that can be generated to individual audio
objects and delivered to the rendering system. This enables an amazing amount of flexibility in
audio content management. From a content management point of view, adaptive audio allows a
variety of things, such as changing the language of audio content, simply by replacing
conversational objects to reduce content file size and / or download time Do. Films, television and
other entertainment programs are usually distributed internationally. This requires that the
language in the piece of content be changed depending on where the content is to be reproduced
(French for films seen in France, German for TV shows seen in Germany, etc.). Today, this often
requires that for each language, a completely independent audio sound track be generated,
packaged and distributed. Due to the adaptive audio system and audio object specific concepts,
conversations of pieces of content can be independent audio objects. This allows the language of
the content to be easily changed without updating or changing other elements of the audio sound
track such as music, effects, etc. This applies not only to foreign languages, but also to languages
inappropriate for a particular audience, targeted advertising, etc.
Embodiments are also directed to systems for rendering object-based sound with headphone
pairs. The system calculates an left and right headphone channel signal for each of the first
plurality of input channels, and an input stage for receiving an input signal comprising the first
plurality of input channels and the second plurality of audio objects. Applying a temporally
invariant binaural room impulse response (BRIR) filter to each signal of the first processor and
the plurality of first input channels, and for each object of the second plurality of objects
Applying a time-varying BRIR filter to generate a set of left and right ear signals. The system
comprises a left channel mixer that mixes the left ear signals together to form an overall left ear
signal, a right channel mixer that mixes the right ear signals together to form an overall right ear
signal, and A left side equalizer that equalizes the entire left ear signal and compensates for the
acoustic transfer function from the left transducer of the headphone to the entrance of the
listener's left ear; and the right transducer of the headphone to the entrance of the listener's right
ear And the right side equalizer which compensates for the sound transfer function of In such
systems, the BRIR filter may include an adder circuit configured to add together the direct path
response and the one or more reflected path responses. Here, the one or more reflection path
responses have the specular reflection effect and the diffusion effect of the listening environment
where the listener is located. The direct path and the one or more reflection paths may each have
a source transfer function, a distance response, and a head related transfer function (HRTF). The
one or more reflection paths each further comprise a surface response of one or more surfaces
disposed within the listening environment. The BRIR filter may be configured to generate correct
responses in the left and right ears of the listener for the source position, source directivity, and
source direction, for the listener at a specific position in the listening environment.
Aspects of the audio environment described herein represent playback of audio or audio / visual
content through appropriate speakers and playback devices, such as movie theaters, concert
halls, outdoor theaters, homes or rooms, viewing rooms, cars, games. It may represent any
environment in which the listener experiences playback of the captured content, such as a
terminal, headphone or headset system, a public address (PA) system, or any other playback
environment. Although the embodiments have mainly been described with respect to examples
and implementations in a home theater environment where spatial audio content is associated
with television content, it should be noted that the embodiments can be implemented in the
environment. Spatial audio content with object-based audio and channel-based audio may be
used in conjunction with any relevant content (related audio, video, graphics, etc.) or constitute
stand-alone audio content It is good. The playback environment may be any suitable listening
environment, from headphones or short range monitors, to narrow or large rooms, cars, outdoor
arenas, concert halls, etc.
Aspects of the systems described herein may be implemented in a suitable computer-based
sound processing network environment that processes digital or digitized audio files. The portion
of the adaptive audio system includes one or more networks having any desired number of
individual machines, including one or more routers (not shown) that provide buffering and
routing of data transmitted between the computers. You may have. Such networks may be built
based on a variety of different network protocols, and may be the Internet, a Wide Area Network
(WAN), a Local Area Network (LAN), or any combination thereof. In one embodiment where the
network comprises the Internet, one or more machines may be configured to access the Internet
through a web browser program.
One or more of the components, processes, or other functional components may be implemented
through a computer program that controls execution of a computing device based on a processor
of the system. It should be understood that the various functions described herein may be
implemented using data, any number of combinations of hardware, firmware, and / or embodied
in various machine-readable or computer-readable media. As and / or instructions, they may be
described in terms of their operation, register transfers, logic components, and / or other
characteristics. Computer readable media in which such formatted data and / or instructions may
be embodied include various types of physical (non-transitory) non-volatile storage media such
as optical, magnetic or semiconductor storage media, but It is not limited to these.
Unless the context indicates otherwise, throughout the description and claims, the words
"comprise, comprising" and the like should be considered to be inclusive and not exclusive or
exhaustive, that is, " Including, but not limited to. Words using the singular or plural number may
include the plural or singular number respectively. Further, the words "herein", "hereunder",
"above", "below", and like terms mean the present application as a whole. Does not represent any
particular part of the present application. When the word "or" is used to refer to a list of two or
more items, the word includes all of the following interpretations of the word. Any of the items in
the list, all of the items in the list, any combination of items in the list.
It should be understood that although one or more implementations have been described by way
of example and in the context of particular embodiments, implementations of one or more
embodiments are not limited to the disclosed embodiments. Rather, it is intended to cover
various modifications and similar arrangements as would be apparent to those skilled in the art.
Accordingly, the scope of the appended claims should be construed broadly to encompass all
such modifications and similar arrangements.
CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of priority to
US Provisional Patent Application No. 61 / 696,056, filed August 31, 2012, the contents of
which are incorporated by reference. The US Provisional Patent Application is incorporated
herein by reference.
Related Applications Each of the publications, patents, and / or patent applications referred to in
this specification should be incorporated by reference, each individual publication and / or
patent application being specifically and individually incorporated by reference. As indicated,
they are incorporated herein in their entirety.
Пожаловаться на содержимое документа