巴西专利BR112015020150B1 APPLIANCE TO GENERATE A SPEECH SIGNAL, AND, METHOD TO GENERATE A SPEECH SIGNAL

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
The present invention relates to an apparatus comprising microphone receivers (101) that receive microphone signals from a plurality of microphones (103). a comparator (105) determines a speech similarity indication indicative of a similarity between the microphone signal and non-reverberant speech for each microphone signal. the determination is made in response to a comparison between a property derived from the microphone signal and a reference property for non-reverberant speech. in some embodiments, the comparator (105) determines the similarity indication and compares the reference properties of the speech samples from a set of non-reverberant speech samples. a generator (107) generates a speech signal by combining the microphone signals in response to similarity indications. in many embodiments, the apparatus can be distributed to a plurality of devices, each containing a microphone, and the approach can determine the most suitable microphone for generating the speech signal.
公开号:BR112015020150B1
申请号:R112015020150-4
申请日:2014-02-18
公开日:2021-08-17
发明作者:Sriram Srinivasan
申请人:Mediatek Inc.；
IPC主号:

专利说明:

FIELD OF THE INVENTION
[001] The invention relates to a method and apparatus for generating a speech signal, and, in particular, for generating a speech signal from a plurality of microphone signals, such as, for example, microphones in different devices. BACKGROUND OF THE INVENTION
[002] Traditionally, speech communication between remote users has been provided through direct bidirectional communication using dedicated devices at each end. Specifically, traditional communication between two users has been through a wired telephone connection or a wireless radio connection between two radio transceivers. However, in recent decades, the variety and possibilities of speech capture and communication have increased substantially and several new speech services and applications have been developed, including more flexible speech communication applications.
[003] For example, the wide acceptance of broadband Internet connectivity has led to new forms of communication. Internet telephony has significantly reduced the cost of communication. This, combined with the tendency for families and friends to spread across the world, has resulted in longer phone conversations. VoIP (Voice over Internet Protocol) calls lasting more than an hour are not uncommon, and user comfort during those long calls is more important today than ever.
[004] Furthermore, the range of devices owned and used by a user has increased considerably. Specifically, devices equipped with audio capture and typically wireless transmission are becoming more and more common, such as mobile phones, tablet computers, notebooks, etc.
[005] The quality of most speech applications is highly dependent on the quality of the captured speech. Consequently, most practical applications are based on placing a microphone close to the speaker's mouth. For example, cell phones include a microphone that, when in use, is placed close to the user's mouth by the user. However, this approach can be impractical in many scenarios and can provide a less-than-optimal user experience. For example, it can be impractical for a user to hold a tablet computer close to their head.
[006] To provide a freer and more flexible user experience, several hands-free solutions have been proposed. They include wireless microphones which are made up of very small housings that can be worn and, for example, attached to the wearer's clothing. However, this is still seen as an inconvenience in many scenarios. In fact, enabling hands-free communication with the freedom to move around and multitask while on a call, but without having to be near a device or wearing a headset, is an important step towards an improved user experience.
[007] Another approach is to use hands-free communication based on a microphone being positioned further away from the user. For example, conference systems have been developed that, when positioned, for example, on a table will capture speakers located around the venue. However, these systems tend not always to provide optimal speech quality, and in particular, the speech of more distant users tends to be faint and noisy. Furthermore, the captured speech, in these scenarios, tends to have a high degree of reverberation that can considerably reduce the speech intelligibility.
[008] It has been proposed to use more than one microphone for, for example, such teleconferencing systems. However, a problem in such cases is how to combine the plurality of microphone signals. A conventional approach is to simply add the signals together. However, this tends to provide suboptimal speech quality. Several more complex approaches have been proposed, such as performing a weighted sum based on the relative signal levels of the microphone signals. However, the approaches tend to provide suboptimal performance in many scenarios, eg still include a high degree of reverberation, are sensitive to absolute levels, are complex, require centralized access to all microphone signals, are relatively impractical, require dedicated devices, etc.
[009] Thus, an improved approach to capturing speech signals would be advantageous and, in particular, an approach allowing increased flexibility, improved speech quality, reduced reverberation, reduced complexity, reduced communication requirements, increased adaptability to different devices (including devices function), reduced resource demand and/or improved performance would be beneficial. SUMMARY OF THE INVENTION
[010] Consequently, the invention preferably seeks to mitigate, alleviate or eliminate one or more of the above mentioned disadvantages, individually or in any combination.
[011] According to one aspect of the invention there is provided an apparatus according to claim 1.
[012] The invention can allow an enhanced speech signal to be generated in many modalities. In particular, it can, in many embodiments, allow a speech signal to be generated with less reverberation and/or often less noise. The approach can enable improved performance of speech applications, and can, in particular, in many scenarios and modalities, provide improved speech communication.
[013] Comparison between at least one property derived from microphone signals and a non-reverberant speech reference property provides a specific efficient and accurate way of identifying the relative importance of individual microphone signals to the speech signal and can, in in particular, provide a better assessment than approaches based on, for example, measurements of signal level or signal-to-noise ratio. In fact, matching captured audio to non-reverberant speech signals can provide a strong indication of how much speech reaches the microphone via a direct path and how much reaches the microphone via reverberant paths.
[014] The at least one reference property can be one or more properties/values that are associated with a non-reverberant speech. In some embodiments, the at least one reference property can be a set of properties corresponding to different non-reverberant speech samples. The similarity indication can be determined to reflect a difference between the value of a at least one derived property of the microphone signal and at least one non-reverberant speech reference property, and specifically for at least one reference property of a non-reverberant speech sample. In some embodiments, the at least one property derived from the microphone signal may be the microphone signal itself. In some embodiments, the at least one non-reverberant speech reference property can be a non-reverberant speech signal. Alternatively, the property can be a suitable feature, such as normalized gain spectral envelopes.
[015] The microphones that supply the microphone signals can, in many modalities, be microphones distributed in an area and can be remote from each other. The approach can, in particular, provide improved usage of audio captured at different positions without requiring those positions to be known or implied by the user or the device/system. For example, microphones can be randomly distributed as needed around a room, and the system can be automatically adapted to provide an enhanced speech signal for the specific mood.
[016] Non-reverberant speech samples may specifically be substantially dry or anechoic speech samples.
[017] The speech similarity indication can be any indication of a degree of difference or similarity between the individual microphone signal (or part of it) and the non-reverberant speech, such as, for example, a non-reverberant speech sample. The similarity indication can be an indication of perceptual similarity.
[018] According to an optional feature of the invention, the apparatus comprises a plurality of separate devices, wherein each device comprises a microphone receiver for receiving at least one microphone signal from the plurality of microphone signals.
[019] This can provide a particularly efficient approach to generating a speech signal. In many embodiments, each device can comprise the microphone that provides the microphone signal. The invention may allow for improved and/or new user experiences with improved performance.
[020] For example, several possible devices can be positioned around a room. When running a speech application, such as a speech communication, individual devices can each provide a microphone signal, which can be evaluated to look for devices/microphones best suited to use to generate the speech signal.
[021] According to an optional feature of the invention, at least a first device among the plurality of separate devices comprises a local comparator for determining a first speech similarity indication for the at least one microphone signal of the first device.
[022] This can provide an optimized operation in many scenarios, and can, in particular, allow a distributed processing that can reduce, for example, the communication resources and/or the demands of scattered computational resources.
[023] Specifically, in many embodiments, separate devices can determine a similarity indication locally and can only transmit the microphone signal if the similarity criterion satisfies a criterion.
[024] According to an optional feature of the invention, the generator is implemented in a generator device separate from at least the first device and wherein the first device comprises a transmitter to transmit the first speech similarity indication to the generator device.
[025] This can advantageously allow for implementation and operation in many modalities. In particular, it can allow, in many modalities, one device to assess the speech quality of all other devices without needing to communicate any audio or speech signal. The transmitter can be arranged to transmit the first speech similarity indication over a wireless communication link, such as a Bluetooth™ or Wi-Fi communication link.
[026] According to an optional feature of the invention, the generator device is arranged to receive speech similarity indications from each of the plurality of separate devices and wherein the generator is arranged to generate the speech signal using a subset of microphone signals from the plurality of separate devices, the subset being determined in response to speech similarity indications received from the plurality of separate devices.
[027] This can allow a highly efficient system in many scenarios where a speech signal can be generated from microphone signals being captured by different devices, with only the best subset of devices being used to generate the speech signal. In this way, communication resources are considerably reduced, typically with no significant impact on the quality of the resulting speech signal.
[028] In many embodiments, the subset can only include a single microphone. In some embodiments, the generator may be arranged to generate the speech signal from a single microphone signal selected from the plurality of microphone signals based on similarity indications.
[029] According to an optional feature of the invention, at least one device among the plurality of separate devices is arranged to transmit the at least one microphone signal from the at least one device to the generating device, only if the at least one signal of the at least one device is comprised in the subset of microphone signals.
[030] This can reduce communication resource usage and can reduce computational resource usage for devices for which the microphone signal is not included in the subset. The transmitter can be arranged to transmit at least one microphone signal over a wireless communication link, such as a Bluetooth™ or Wi-Fi communication link.
[031] According to an optional feature of the invention, the generator device comprises a selector arranged to determine the subset of microphone signals and a transmitter to transmit an indication of the subset to at least one of a plurality of separate devices.
[032] This can provide beneficial operation in many scenarios.
[033] In some embodiments, the generator may determine the subset and may be arranged to transmit an indication of the subset to at least one device among the plurality of devices. For example, for the microphone signal device or devices comprised in the subassembly, the generator may transmit an indication that the device is to transmit the microphone signal to the generator.
[034] The transmitter can be arranged to transmit the indication via a wireless communication link, such as a Bluetooth™ or Wi-Fi communication link.
[035] According to an optional feature of the invention, the comparator is arranged to determine the similarity indication of a first microphone signal in response to a comparison between at least one derived property of the microphone signal and reference properties of samples of speaks of a set of non-reverberant speech samples.
[036] Comparing microphone signals with a broad set of non-reverberant speech samples (eg, in a suitable resource domain) provides a specific efficient and accurate way of identifying the relative importance of individual microphone signals to the signal. and may, in particular, provide a better assessment than approaches based on, for example, measurements of signal level or signal-to-noise ratio. In fact, matching captured audio to non-reverberant speech signals can provide a strong indication of how much speech reaches the microphone via a direct path and how much reaches the microphone via reverberant/reflected paths. In fact, it can be considered that the comparison with non-reverberant speech samples includes a consideration of the shape of the impulse response of the acoustic paths rather than just a consideration of energy or level.
[037] The approach can be independent of the speaker and, in some modalities, the set of non-reverberant speech samples can include samples corresponding to different characteristics of the speaker (such as a high or low voice). In many modalities, processing can be segmented, and the set of non-reverberant speech samples can, for example, comprise samples corresponding to human speech phonemes.
[038] The comparator can, for each microphone signal, determine an individual similarity indication for each speech sample of the set of non-reverberant speech samples. The similarity indication of the microphone signal can then be determined from the individual similarity indications, for example by selecting the individual similarity indication which is indicative of the greatest degree of similarity. In many scenarios, the best matching speech sample can be identified and the indication of microphone signal similarity can be determined in relation to that speech sample. The similarity indication can provide an indication of a similarity of the microphone signal (or part thereof) to the non-reverberant speech sample of the non-reverberant speech sample set for which the greatest similarity is found.
[039] The indication of similarity of a given speech signal sample may reflect the probability that the microphone signal resulted from a pronounced speech corresponding to the speech sample.
[040] According to an optional feature of the invention, the speech samples of the set of non-reverberant speech samples are represented by parameters of a non-reverberant speech model.
[041] This can provide efficient as well as reliable and/or accurate operation. This approach can, in many ways, reduce computational resource and/or memory requirements.
[042] The comparator can, in some modalities, evaluate the model for the different sets of parameters and compare the resulting signals from the microphone signals. For example, frequency representations of microphone signals and speech samples can be compared.
[043] In some modalities, speech model model parameters can be generated from the microphone signal, that is, the model parameters that would result in a speech sample corresponding to the microphone signal can be determined. These model parameters can then be compared to the parameters of the non-reverberant speech sample set.
[044] The non-reverberant speech model can be specifically a Linear Prediction model, such as a CELP (Linear Excited Code Prediction) model.
[045] According to an optional feature of the invention, the comparator is arranged to determine a first reference property of a first speech sample of the set of non-reverberant speech samples from a speech sample signal generated by the evaluation of the non-reverberant speech model using parameters from the first speech sample, and to determine the similarity indication of a first microphone signal of the plurality of microphone signals in response to a comparison between the derived property of the first microphone signal and the first reference property.
[046] This can provide beneficial operation in many scenarios. The similarity indication of the first microphone signal can be determined by comparing a property determined for the first microphone signal and the reference properties determined for each of the non-reverberant speech samples, the reference properties being determined from a representation generated by evaluating the model. In this way, the comparator can compare a property of the microphone signal with a property of the signal samples resulting from the evaluation of the non-reverberant speech model using the stored parameters of the non-reverberant speech samples.
[047] According to an optional feature of the invention, the comparator is arranged to decompose a first microphone signal from a plurality of microphone signals into a set of base signal vectors and to determine the similarity indication in response to a property of the base signal vector set.
[048] This can provide beneficial operation in many scenarios. The approach can allow for complexity and/or reduced resource usage in many scenarios. The reference property can be related to a set of base vectors in a suitable feature domain, from which a vector of non-reverberant features can be generated as a weighted sum of the base vectors. This set can be designed so that each weighted sum with just a few base vectors is sufficient to accurately describe the non-reverberant feature vector, that is, the base vector set provides a sparse representation of non-reverberant speech. The reference property can be the number of base vectors that appear in the weighted sum. Using a base vector set that is designed for non-reverberant speech to describe a reverberant speech feature vector will result in less sparse decomposition. The property can be the number of base vectors that receive a non-zero weight (or a weight above a certain threshold) when used to describe a feature vector extracted from the microphone signal. The similarity indication can indicate an increased similarity of a non-reverberant speech to a reduced number of basic signal vectors.
[049] According to an optional feature of the invention, the comparator is arranged to determine the speech similarity indications of each segment of a plurality of segments of the speech signal, and the generator is arranged to determine the combining parameters to match each thread.
[050] The device can use segmented processing. The combination can be constant for each segment, but it can vary from one segment to the next. For example, the speech signal can be generated by selecting a microphone signal in each segment. The combination parameters can, for example, be microphone signal combination weights or can, for example, be a selection of a subset of microphone signals to include in the combination. The approach can provide improved performance and/or easier operation.
[051] According to an optional feature of the invention, the generator is arranged to determine the combination parameters of a segment in response to similarity indications from at least one previous segment.
[052] This can provide improved performance in many cases. For example, it can provide better adaptation to slow changes and can reduce interruptions in the generated speech signal.
[053] In some embodiments, blending parameters can be determined only based on segments containing speech and not on segments during periods of silence or pauses.
[054] In some embodiments, the generator is arranged to determine the combination parameters of a first segment in response to a user's movement model.
[055] According to an optional feature of the invention, the generator is arranged to select a subset of microphone signals to combine in response to similarity indications.
[056] This can provide improved and/or facilitated operation in many modalities. The combination can be specifically combination by selection. The generator can specifically select only microphone signals for which the similarity indication satisfies an absolute or relative criterion.
[057] In some embodiments, the microphone signal subset comprises only one microphone signal.
[058] According to an optional feature of the invention, the generator is arranged to generate the speech signal as a weighted combination of the microphone signals, a weight of a first of the microphone signals depending on the similarity indication of the microphone signal.
[059] This can provide improved and/or facilitated operation in many modalities.
[060] According to an aspect of the invention, there is provided a method for generating a speech signal, the method comprising: receiving microphone signals from a plurality of microphones; for each microphone signal, determine a speech similarity indication indicative of a similarity between the microphone signal and non-reverberant speech, the similarity indication being determined in response to a comparison between at least one property derived from the microphone signal to the minus one reference property for non-reverberant speech and generating the speech signal by combining the microphone signals in response to similarity indications.
[061] These and other aspects, the characteristics and advantages of the invention will be evident from and elucidated with reference to the modality(s) described later in this document. BRIEF DESCRIPTION OF THE DRAWINGS
[062] The embodiments of the invention will be described, by way of example only, with reference to the drawings, in which Figure 1 is an illustration of a speech capture device according to some embodiments of the invention; Figure 2 is an illustration of a speech capture system in accordance with some embodiments of the invention; Figure 3 illustrates an example of spectral envelopes corresponding to a speech segment recorded at three different distances in a reverberant room and Figure 4 illustrates an example of a probability of a microphone being the closest microphone to a particular speaker according to some embodiments of the invention. DETAILED DESCRIPTION OF SOME MODALITIES OF THE INVENTION
[063] The following description focuses on the embodiments of the invention applicable to speech capture to generate a speech signal for telecommunication. However, it will be understood that the invention is not limited to that application, but can be applied to many other services and applications.
[064] Figure 1 illustrates an example of elements of a speech capture device according to some embodiments of the invention.
[065] In the example, the speech capture apparatus comprises a plurality of microphone receivers 101 which are coupled to a plurality of microphones 103 (which may be part of the apparatus or may be external to the apparatus).
[066] The microphone receiver array 101 thus receives a set of microphone signals from the microphones 103. In the example, the microphones 103 are distributed around a room at various unknown positions. In this way, different microphones can capture sound from different areas, can capture the same sound with different characteristics, or can actually capture the same sound with similar characteristics if they are close to each other. The relationship between microphones 103 and between microphones 103 and different sound sources are typically not known to the system.
[067] The speech capture apparatus is arranged to generate a speech signal from the microphone signals. Specifically, the system is arranged to process the microphone signals to extract a speech signal from the audio captured by the microphones 103. The system is arranged to combine the microphone signals depending on whether they correspond to a non-reverberant speech signal providing, thus , a combined signal that is most likely to match that signal. The combination may specifically be a selection combination in which the device selects the microphone signal that most closely resembles a non-reverberant speech signal. The speech signal generation can be independent of the specific position of the individual microphones and does not depend on any knowledge of the position of the microphones 103 or any speaker. Instead, the microphones 103 can, for example, be randomly distributed around the room, and the system can automatically adapt to, for example, predominantly use the signal from the microphone closest to any speaker. This adaptation can happen automatically, and the specific approach to identifying this closest microphone 103 (as described below) will result in a particularly suitable speech signal in most cases.
[068] In the speech capture apparatus of Figure 1, the microphone receiver 103 is coupled to a comparator or similarity processor 105 that receives the microphone signals.
[069] For each microphone signal, the similarity processor 105 determines a speech similarity indication (hereafter referred to as the similarity indication only) that is indicative of a similarity between the microphone signal and non-reverberant speech. The similarity processor 105 specifically determines the similarity indication in response to a comparison between at least one property derived from the microphone signal and at least one reference property of the non-reverberant speech. The reference property can, in some modalities, be a single scalar value, and in other modalities, it can be a complex set of values or functions. The reference property can, in some modalities, be derived from specific non-reverberant speech signals, and can, in other modalities, be a generic characteristic associated with non-reverberant speech. The reference property and/or derived property of the microphone signal can be, for example, a spectrum, a spectral power density characteristic, various non-zero base vectors, etc. In some embodiments, the properties can be signals, and specifically, the property derived from the microphone signal can be the microphone signal itself. Similarly, the reference property can be a non-reverberant speech signal.
[070] Specifically, the similarity processor 105 may be arranged to generate a similarity indication for each of the microphone signals where the similarity indication is indicative of a similarity of the microphone signal with a speech sample from a set of samples of non-reverberant speech. Thus, in the example, the similarity processor 105 comprises a memory storing a (typically large) number of speech samples where each speech sample corresponds to speech in a non-reverberant and specifically substantially anechoic room. As an example, the similarity processor 105 can compare each microphone signal with each of the speech samples and, for each speech sample, determine a difference measurement between the stored speech sample and the microphone signal. The difference measurements of the speech samples can then be compared and the measurement indicative of the smallest difference can be selected. This measurement can then be used to generate (or as) an indication of the similarity of the specific microphone signal. The process is repeated for all microphone signals resulting in a set of similarity indications. In this way, the set of similarity cues can indicate how much each of the microphone signals resembles non-reverberant speech.
[071] In many modalities and scenarios, this signal sample domain comparison may not be reliable enough due to uncertainties related to variations in microphone levels, noise, etc. Therefore, in many modalities, the comparator can be arranged to determine the similarity indication in response to a comparison performed in the resource domain. Thus, in many modalities, the comparator can be arranged to determine some features/parameters of the microphone signal and compare them to the stored features/parameters of non-reverberant speech. For example, as will be described in detail later, the comparison can be based on parameters from a speech model, such as coefficients from a linear prediction model. Corresponding parameters can then be determined for the microphone signal and compared to stored parameters corresponding to various speeches uttered in an anechoic environment.
[072] Non-reverberant speech is typically achieved when the acoustic transfer function of a speaker is mastered in the direct trajectory and with the reflected and reverberant parts being considerably attenuated. This typically also matches situations where the speaker is relatively close to the microphone and may more closely match a traditional arrangement where the microphone is positioned close to the speaker's mouth. Non-reverberant speech can also often be considered the most intelligible and is, in fact, the closest match to the origin of speech.
[073] The device in Figure 1 uses an approach that allows the characteristic of speech reverberation of individual microphones to be evaluated so that this can be taken into account. In fact, Inventor noted not only that consideration of the speech reverberation characteristics of individual microphone signals in generating a speech signal can considerably optimize quality, but also how this can be possible without the need for test signals and measurements. dedicated. In fact, Inventor has observed that by comparing a property of individual microphone signals with a reference property associated with non-reverberant speech, and specifically with non-reverberant speech sample sets, it is possible to determine the proper parameters to match the signals. microphone to generate an enhanced speech signal. In particular, the approach allows the speech signal to be generated without requiring any dedicated test signals, test measurements or actually prior knowledge of speech. In fact, the system can be designed to work with any speech and does not require, for example, specific test words or phrases to be communicated by the speaker.
[074] In the system of Figure 1, the similarity processor 105 is coupled to a generator 107 that receives the similarity indications. Generator 107 is further coupled to microphone receivers 101 from which it receives microphone signals. Generator 107 is arranged to generate a speech output signal by combining the microphone signals in response to similarity indications.
[075] As an example of low complexity, generator 107 can implement a select combiner, whereby, for example, a microphone signal is selected from the plurality of microphone signals. Specifically, generator 107 can select the microphone signal that best matches the non-reverberant speech sample. The speech signal is then generated from the microphone signal which is typically more likely to be the cleanest and clearest speech capture. Specifically, it is probably the one that most closely corresponds to the speech articulated by the listener. Typically, it will also match the microphone that is closest to the speaker.
[076] In some embodiments, the speech signal can be communicated to a remote user, for example, through a telephone network, a wireless connection, the Internet or any other communication network or link. Speech signal communication may typically include speech encoding as well as possibly other processing.
[077] The device in Figure 1 can, in this way, automatically adapt to the positions of the speaker and microphones, as well as the acoustic characteristics of the environment to generate a speech signal that more closely matches the original speech signal. Specifically, the speech signal generated will tend to have reduced reverberation and noise, and will consequently sound less distorted, cleaner and more intelligible.
[078] It will be understood that processing may include various other processing, typically including amplification, filtering, conversion between time domain and frequency domain, etc., as is typically done in audio and speech processing. For example, microphone signals can often be amplified and filtered before being combined and/or used to generate similarity indications. Similarly, generator 107 may include filtering, amplifying, etc., as part of combining and/or generating the speech signal.
[079] In many embodiments, the speech capture device can use segmented processing. In this way, processing can be done in small time intervals, such as in segments less than 100 ms in duration, and often in segments around 20 ms.
[080] Thus, in some modalities, a similarity indication can be generated for each microphone signal in a given segment. For example, a mic signal segment of, say, 50 ms duration can be generated for each of the mic signals. The segment can then be compared to the set of non-reverberant speech samples that can be comprised of speech segment samples. Similarity indications can be determined for that 50ms segment, and generator 107 can proceed to generate a speech signal segment for the 50ms interval based on the microphone signal segments and the segment/similarity indications. break. In this way, the combination can be updated for each segment, for example, by selecting, in each segment, the microphone signal that has the greatest similarity to a speech segment sample from the non-reverberant speech samples. This can provide particularly efficient processing and operation and can allow for continuous and dynamic adaptation to the specific environment. In fact, an adaptation to dynamic movement in the speaker and/or microphone sound source positions can be achieved with low complexity. For example, if speech is switched between two sources (speakers), the system can adapt to switch correspondingly between two microphones.
[081] In some embodiments, the non-reverberant speech segment samples may have a duration that corresponds to the microphone signal segments. However, in some modalities, they can be longer. For example, each non-reverberant speech segment sample can correspond to a specific phoneme or speech sound that has a longer duration. In these embodiments, determining a similarity measurement for each non-reverberant speech segment sample may include an alignment of the microphone signal segment with the speech segment samples. For example, a correlation value can be determined for different time offsets and the highest value can be selected as the similarity indication. This allows a reduced number of speech segment samples to be stored.
[082] In some examples, combination parameters, such as a selection of a subset of microphone signals to be used, or weights for a linear sum, can be determined for a time interval of the speech signal. In this way, the speech signal can be determined in segments from a combination that is based on parameters that are constant for the segment, but that can vary between segments.
[083] In some embodiments, the determination of the combination parameters is independent for each time segment, that is, the time segment combination parameters can be calculated based only on similarity indications that are determined for that time segment .
[084] However, in other embodiments, the combination parameters may alternatively or additionally be determined in response to similarity indications from at least one previous segment. For example, similarity indications can be filtered using a low-pass filter that spans multiple segments. This can ensure a slower adaptation which can, for example, reduce fluctuations and variations in the generated speech signal. As another example, a hysteresis effect can be applied that prevents, for example, the rapid change in ping-pong between two microphones positioned at approximately the same distance from the speaker.
[085] In some embodiments, the generator 107 can be arranged to determine the combination parameters of a first segment in response to a user's movement model. This approach can be used to track the user's relative position relative to microphone devices 201, 203, 205. The user model does not need to explicitly track user positions or microphone devices 201, 203, 205, but can directly track the variations of the similarity indications. For example, a state-space representation can be employed to describe a human motion model, and a Kalman filter can be applied to the similarity indications of individual segments of a microphone signal to track variations in the similarity indications due to motion. . The resulting output from the Kalman filter can then be used as an indication of the similarity of the current segment.
[086] In many embodiments, the functionality of Figure 1 can be implemented in a distributed way, and, in particular, the system can be spread over a plurality of devices. Specifically, each of the microphones 103 can be part of, or connected to, a different device, and thus the microphone receivers 101 can comprise different devices.
[087] In some embodiments, the similarity processor 105 and the generator 107 are implemented in a single device. For example, several different remote devices can transmit a microphone signal to a generator device that is arranged to generate a speech signal from the received microphone signals. Such generator device may implement the functionality of similarity processor 105 and generator 107 as described above.
[088] However, in many embodiments, the functionality of similarity processor 105 is distributed across a plurality of separate devices. Specifically, each of the devices may comprise a (sub)similarity processor 105 which is arranged to determine a similarity indication for the microphone signal from that device. The similarity indications can then be transmitted to the generating device which can determine parameters for the combination based on the received similarity indications. For example, they can simply select the signal/microphone device that has the highest similarity indication. In some embodiments, devices may not transmit microphone signals to the generating device unless the generating device requests it. Consequently, the generating device can transmit a request for the microphone signal to the selected device which in turn supplies that signal to the generating device. The generator device then proceeds to generate the output signal based on the received microphone signal. In fact, in this example, the generator 107 can be considered to be distributed to the devices with the combination being obtained by the process of selecting and selectively transmitting the microphone signal. An advantage of this approach is that only one (or at least a subset) of the microphone signals need to be transmitted to the generating device, and in this way that considerably reduced communication resource usage can be obtained.
[089] As an example, the approach can use device microphones distributed in an area of interest to capture a user's speech. A typical modern living room typically has several devices equipped with one or more microphones and wireless transmission capabilities. Examples include landline cordless phones, mobile phones, video chat-enabled televisions, PCs, tablets, laptops, etc. These devices can, in some modalities, be used to generate a speech signal, for example, automatically and adaptively selecting speech capture by the microphone closest to the speaker. This can provide captured speech that will typically be of high quality and reverb-free.
[090] In fact, in general, the signal captured by a microphone will tend to be affected by reverberation, ambient noise and microphone noise, with the impact depending on its location in relation to the source of the sound, for example, the mouth of the user. The system may try to select the microphone that is closest to what would be recorded by a microphone close to the user's mouth. The generated speech signal can be applied where hands-free speech capture is desirable, eg home/office telephony, teleconferencing systems, front-end for voice control systems, etc.
[091] In more detail, Figure 2 illustrates an example of a distributed speech generation/capture apparatus/system. The example includes a plurality of microphone devices 201, 203, 205, as well as a generator device 207.
[092] Each of the microphone devices 201, 203, 205 comprises a microphone receiver 101 that receives a microphone signal from a microphone 103 which, in the example, is part of the microphone device 201, 203, 205, but in other cases it may be separate from it (e.g. one or more of the microphone devices 201, 203, 205 may comprise a microphone input for attaching an external microphone). The microphone receiver 101 in each microphone device 201, 203, 205 is coupled to a similarity processor 105 which determines a similarity indication and microphone signal.
[093] The similarity processor 105 of each microphone device 201, 203, 205 specifically performs the operation of the similarity processor 105 of Figure 1 for the specific microphone signal of the individual microphone device 201, 203, 205. the similarity processor 105 of each of the microphone devices 201, 203, 205 specifically proceeds to compare the microphone signal to a set of non-reverberant speech samples that are locally stored in each of the devices. The similarity processor 105 can specifically compare the microphone signal for each of the non-reverberant speech samples and for each speech sample to determine an indication of how similar the signals are. For example, if similarity processor 105 includes memory to store a local database comprising a representation of each of the phonemes of human speech, similarity processor 105 may proceed to compare the microphone signal of each phoneme. In this way, a set of indications indicating how much the microphone signal is like for each of the phonemes that does not include any reverberation or noise is determined. The indication corresponding to the closest match is likely, therefore, to correspond to an indication of how closely the captured audio matches the sound generated by a speaker pronouncing that phoneme. In this way, the closest similarity indication is chosen as the similarity indication for the microphone signal. The similarity indication therefore reflects how much the captured audio corresponds to noise-free and reverberation-free speech. For a microphone (and thus typically device) positioned far from the speaker, the captured audio is likely to include only relatively low levels of the original projected speech compared to the contribution of various reflections, reverberation, and noise. However, for a microphone (and thus device) positioned close to the speaker, the captured sound is likely to comprise a significantly greater contribution from the direct acoustic trajectory and relatively lower contributions from reflections and noise. Consequently, the similarity indication provides a good indication of the clarity and speech intelligibility of the captured audio from the individual device.
[094] Each of the microphone devices 201, 203, 205 further comprises a wireless transceiver 209 that is coupled to the similarity processor 105 and the microphone receiver 101 of each device. Wireless transceiver 209 is specifically arranged to communicate with generator device 207 via a wireless connection.
[095] The generator device 207 also comprises a wireless transceiver 211 that can communicate with the microphone devices 201, 203, 205 through the wireless connection.
[096] In many embodiments, microphone devices 201, 203, 205 and generator device 207 can be arranged to communicate data in both directions. However, it will be understood that, in some embodiments, only one-way communication from microphone devices 201, 203, 205 to generator device 207 can be applied.
[097] In many embodiments, devices can communicate over a wireless communication network, such as a local Wi-Fi communication network. Thus, the wireless transceiver 207 of microphone devices 201, 203, 205 may specifically be arranged to communicate with other devices (and specifically generator device 207) via Wi-Fi communications. However, it will be understood that, in other embodiments, other methods of communication may be used, including, for example, communication, for example, over wired or wireless communication links Local Area Network, Wide Area Network, Internet, Bluetooth™ etc.
[098] In some embodiments, each of the microphone devices 201, 203, 205 can always transmit the similarity indications and the microphone signals to the generator device 207. It will be understood that the person skilled in the art has knowledge of how the data, such as parameter data and audio data, can be communicated between devices. Specifically, one skilled in the art will understand how audio signal transmission can include encoding, compression, error correction, etc.
[099] In these embodiments, the generator device 207 can receive the microphone signals and the similarity indications from all the microphone devices 201, 203, 205. It can then proceed to combine the microphone signals based on the similarity indications for generate the speech signal.
[0100] Specifically, the wireless transceiver 211 of the generating device 207 is coupled to a controller 213 and a speech signal generator 215. The controller 213 receives the similarity indications from the wireless transceiver 211 and in response to them, determines a set of combination parameters that control how the speech signal is generated from the microphone signals. Controller 213 is coupled to speech signal generator 215 which receives the combination parameters. Furthermore, the speech signal generator 215 receives the microphone signals from the wireless transceiver 211, and can therefore proceed to generate the speech signal based on the combination parameters.
[0101] As a specific example, the controller 213 can compare the received similarity indications and identify the one indicating the greatest degree of similarity. An indication of the corresponding device/microphone signal can then be passed to speech signal generator 215 which can proceed to select the microphone signal from that device. The speech signal is then generated from that microphone signal.
[0102] As another example, in some embodiments, speech signal generator 215 may proceed to generate the output speech signal as a weighted combination of the received microphone signals. For example, a weighted sum of received microphone signals can be applied where the weights of each individual signal are generated from the similarity indications. For example, similarity indications can be directly provided as a scalar value within a given range, and individual weights can be directly proportional to the scalar value (with, for example, a proportionality factor ensuring that the signal level or value of accumulated weight is constant).
[0103] Such an approach can be particularly attractive in scenarios where the available communication bandwidth is not a restriction. In this way, rather than selecting a larger device close to the speaker, a weight can be assigned to each device/mic signal, and mic signals from multiple mics can be combined as a weighted sum. This approach can provide robustness and reduce the impact of erroneous selection in highly reverberant or noisy environments.
[0104] It will also be understood that combination approaches can be combined. For example, instead of using a pure selection combination, controller 213 can select a subset of microphone signals (such as microphone signals for which the similarity indication exceeds a threshold) and then combine subset microphone signals using weights that are dependent on similarity indications.
[0105] It will also be understood that, in some embodiments, the combination may include an alignment of different signals. For example, time delays can be introduced to ensure that received speech signals are coherently added for a given speaker.
[0106] In many embodiments, microphone signals are not transmitted to the generator device 207 from all microphone devices 201, 203, 205, but only from microphone devices 201, 203, 205 from which the speech signal will be generated.
[0107] For example, microphone devices 201, 203, 205 may first transmit the similarity indications to generator device 207 with controller 213 evaluating the similarity indications to select a subset of microphone signals. For example, controller 213 may select the microphone signal from microphone device 201, 203, 205 that sent the similarity indication that indicates the greatest similarity. Controller 213 may then transmit a request message to selected microphone device 201, 203, 205 using wireless transceiver 211. Microphone devices 201, 203, 205 may be arranged to transmit only data to the generating device 207 when a request message is received, i.e. the microphone signal is only transmitted to the generating device 207 when it is included in the selected subset. Thus, in the example where only a single microphone signal is selected, only one of the microphone devices 201, 203, 205 transmits a microphone signal. Such an approach can considerably reduce the use of the communication resource, in addition to reducing, for example, the energy consumption of individual devices. And it can also considerably reduce the complexity of the generator device 207, as it only needs to handle, for example, one microphone signal at a time. In the example, the selection matching functionality used to generate the speech signal is thus distributed across the devices.
[0108] Different approaches to determining similarity indications can be used in different modalities, and specifically the stored representations of non-reverberant speech samples can be different in different modalities and can be used differently in different modalities.
[0109] In some modalities, the stored non-reverberant speech samples are represented by parameters of a non-reverberant speech model. In this way, rather than storing, for example, a sample representation of the signal's time or frequency domain, the non-reverberant speech sample set can comprise a set of parameters for each sample that can allow the sample to be generated.
[0110] For example, the non-reverberant speech model can specifically be a linear prediction model, such as a CELP (Linear Excited Code Prediction) model. In this scenario, each speech sample from the non-reverberant speech samples can be represented by a codebook entry that specifies an excitation signal that can be used to drive a synthesis filter (which may also be represented by stored parameters).
[0111] Such an approach can considerably reduce the storage requirements for the set of non-reverberant speech samples, and this can be particularly important for distributed implementations where the determination of similarity indications is performed locally on the individual devices. Furthermore, using a speech model that directly synthesizes speech from the speech origin (without taking into account the acoustic environment), a good representation of non-reverberant, anechoic speech is obtained.
[0112] In some modalities, the comparison between a microphone signal and a specific speech sample can be done by evaluating the speech model of the specific set of parameters of the speech model stored for that signal. In this way, a representation of the speech signal that will be synthesized by the speech model from this set of parameters can be derived. The resulting representation can then be compared to the microphone signal and a measure of the difference between them can be calculated. The comparison can, for example, be performed in the time domain or the frequency domain, and it can be a random comparison. For example, the indication of similarity of a microphone signal and a speech sample can be determined to reflect the probability that the captured microphone signal resulted from a sound source radiating from the speech signal resulting from a speech model synthesis. . The speech sample resulting in the highest probability can then be selected, and the microphone signal similarity indication can be determined as the highest probability.
[0113] Below, a detailed example of a possible approach to determine similarity indications based on an LP speech model will be provided.
[0114] In the example, K microphones can be distributed in an area. The observed microphone signals can be modeled as
where s(n) is the speech signal in the user's mouth, hk(n) is the acoustic transfer function between the location corresponding to the user's mouth and the location of the k°microphone and wk(n) is the noise signal , including ambient and microphone self-noise. Assuming that speech and noise signals are independent, an equivalent representation in the frequency domain in terms of power spectral densities (DEP) of the corresponding signals is given by:

[0115] In an anechoic environment, the impulse response hk(n) corresponds to the pure delay, corresponding to the time it takes the signal to propagate from the generation point to the microphone at the speed of sound. Consequently, the DEP of signal xk(n) is identical to that of s(n). In a reverberant environment, hk(n) not only models the direct path of the signal from the source of the sound to the microphone, but also signals the arrival at the microphone as a result of being reflected off walls, ceiling, furniture, etc. Each reflection delays and attenuates the signal.
[0116] The DEP of xk(n) in this case could vary significantly from that of s(n), depending on the level of reverberation. Figure 3 illustrates an example of spectral envelopes corresponding to a 32 ms segment of speech recorded at three different distances in a reverberant room, with a T60 of 0.8 seconds. Clearly, the spectral envelopes of speech recorded at a distance of 5 cm and 50 cm from the speaker are relatively close, while the envelope at 350 cm is significantly different.
[0117] When the signal of interest is speech, as in hands-free communication applications, DEP can be modeled using an offline trained codebook using a wide range of data. For example, the codebook may contain linear prediction coefficients (LP), which model the spectral envelope.
[0118] The training set typically consists of LP vectors extracted from short (20 - 30 ms) segments of a large phonetically balanced speech dataset. These codebooks have been used successfully in encoding and improving speech. A codebook trained in speech recorded using a microphone located close to the user's mouth can then be used as a reference measure of how reverberant the signal received at a particular microphone is.
[0119] The spectral envelope corresponding to a short time segment of a microphone signal captured in a microphone close to the speaker will typically find a better match in the codebook than that captured in a microphone farther away (and thus relatively more affected by reverberation and noise). This observation can then be used, for example, to select a suitable microphone signal in a given scenario.
[0120] Assuming that the noise is Gaussian, and given a vector of coefficients LP a, we have the k° microphone (see, for example, S. Srinivasan, J. Samuelsson, and WB Kleijn, “Codebook driven short-term predictor parameter estimation for speech enhancement", IEEE Trans. Speech, Audio and Language Processing, vol. 14, no. 1, pages 163-176, Jan. 2006):
where yk =[ yk (0), yk (1),..., yk (N -1)] T, a=[1, ai,..., aM ] T is the given vector of coefficients LP, M is the order of the LP model, N is the number of samples in a short time segment, Rkw is the autocorrelation matrix of the noise signal at the kth microphone, and RX=g(ATA)-1, where A is the matrix of lower triangular Toeplitz NxN with [1, a 1, a2,., aM,:0,...,0] T as the first column, eg is a gain term to compensate for the level difference between the book spectra of normalized codes and the observed spectra.
[0121] If we let the frame length approach infinity, the covariance matrices can be described as circulating and are diagonalized by the Fourier transform. The logarithm of probability in the above equation, corresponding to the ith vector of the speech codebook ai, can then be described using frequency domain quantities such as (see, for example, U. Grenander and G. Szego, “ Toeplitz forms and their applications", 2nd ed. New York, USA: Chelsea, 1984):
where C captures the sign-independent constant terms and Ai(o) is the spectrum of the i° vector of the codebook, given by

[0122] For a given codebook vector ai, the gain compensation term can be obtained as:

[0123] where negative numerator values that may arise due to erroneous Pwk(a) noise DEP estimates are set to zero. Note that all quantities in this equation are available. The noise DEP Pyk(w) and the noise DEP Pwk(w) can be estimated from the microphone signal, and Ai(w) is specified by the ith vector of the codebook.
[0124] For each sensor, a maximum likelihood value is calculated for all vectors in the codebook, ie,
where I is the number of vectors in the speech codebook. The maximum likelihood value is then used as the similarity indication for the specific microphone signal.
[0125] Finally, the microphone, for the highest value of the maximum likelihood value t is determined as the microphone closest to the speaker, that is, the microphone signal resulting in the highest maximum likelihood value is determined:

[0126] Experiments were performed for this specific example. A codebook of speech LP coefficients was generated using training data from the Wall Street Journal (WSJ) speech database (CSR-II (WSJ1) Complete”, Linguistic Data Consortium, Philadelphia, 1994 ). 180 different training speeches, lasting about 5s each, from 50 different speakers, 25 men and 25 women, were used as training data. Using the training speeches, about 55,000 LP coefficients were extracted from the 256-sample-sized Hann window segments, with a 50 percent overlap at a sampling frequency of 8 kHz. The codebooks were trained using the LBG algorithm (Y. Linde, A. Buzo, and RM Gray, “An algorithm for vector quantizer design”, IEEE Trans. Communications, vol. COM-28, No. 1, pages 84- 95, Jan. 1980.) with Itakura-Saito distortion (SR Quackenbush, TP Barnwell, and MA Clements, Objective “Measures of Speech Quality.” New Jersey, USA: Prentice-Hall, 1988.) as the error criterion. The codebook size was set at 256 entries. A three-microphone configuration was considered, and the microphones were located 50 cm, 150 cm and 350 cm from the speaker in a reverberant room (T60 = 800 ms). The impulse response between the speaker location and each of the three microphones was recorded and then wrapped with a dry speech signal to obtain the microphone data. Microphone noise in each microphone was 40 dB below speech level.
[0127] Figure 4 shows the probability p(y1) of a microphone located 50 cm from the speaker. In regions dominated by speech, this microphone (which is situated closest to the speaker) receives a value close to unity and the probability values in the other two microphones are close to zero. The closest microphone is thus correctly identified.
[0128] A specific advantage of the approach is that it inherently compensates for signal level differences between different microphones.
[0129] It should be noted that the approach selects the appropriate microphone during speech activity. However, during non-speech segments (such as pauses in speech or when the speaker changes), it will not allow this selection to be determined. However, this can simply be addressed by the system including a speech activity detector (such as a single level detector) to identify periods of non-speech. During these periods, the system can simply proceed to use the matching parameters determined for the last segment that included a speech component.
[0130] In previous modalities, similarity indications were generated by comparing properties of microphone signals with properties of non-reverberant speech samples, and specifically comparing properties of microphone signals with properties of speech signals that result from the evaluation of a speech model using stored parameters.
[0131] However, in other modalities, a set of properties can be derived by analyzing the microphone signals and these properties can then be compared to the expected values of non-reverberant speech. Thus, the comparison can be made in the domain of the parameter or property without considering specific samples of non-reverberant speech.
[0132] Specifically, the similarity processor 105 can be arranged to decompose microphone signals using a set of base signal vectors. This decomposition can specifically use a sparse overcomplete dictionary that contains signal prototypes, also called atoms. A sign is then described as a linear combination of a subset of the dictionary. In this way, each atom can, in this case, correspond to a base signal vector.
[0133] In such embodiments, the property derived from the microphone signals and used in the comparison may be the number of base signal vectors and specifically the number of dictionary atoms needed to represent the signal in a suitable resource domain.
[0134] The property can then be compared to one or more expected properties of non-reverberant speech. For example, in many embodiments, values from the base vector set can be compared to samples of values from the base vector sets corresponding to specific non-reverberant speech samples.
[0135] However, in many modalities, a simpler approach can be used. Specifically, if the dictionary is trained for non-reverberant speech, then a microphone signal that contains less reverberant speech can be described using a relatively low number of dictionary atoms. The more the signal is exposed to reverberation and noise, an increasing number of atoms will be needed, that is, the energy tends to be spread more evenly over more base vectors.
[0136] Consequently, in many modalities, the energy distribution by the base vectors can be evaluated and used to determine the similarity indication. The more the distribution spreads, the smaller is the indication of similarity.
[0137] As a specific example, when comparing signals from two microphones, the one that can be described using fewer dictionary atoms is more similar to non-reverberant speech (where the dictionary has been trained in non-reverberant speech).
[0138] As a specific example, the number of base vectors for which the value (specifically the weight of each base vector in a combination of base vectors approaching the signal) exceeds a given threshold can be used to determine the indication of similarity. In fact, the number of base vectors that exceed the threshold can simply be calculated and directly used as the indication of similarity for a given microphone signal, with an increasing number of base vectors indicating reduced similarity. Thus, the property derived from the microphone signal can be the number of base vector values that exceed a threshold, and this can be compared to a reference property of a zero non-reverberant speech or a base vector that has values above the threshold. Thus, the greater the number of base vectors, the lower the similarity indication.
[0139] It should be understood that the above description for clarity described the embodiments of the invention with reference to different circuits, units and functional processors. However, it will be evident that any suitable distribution of functionality between different functional circuits, units or processors can be used without departing from the invention. For example, functionality illustrated to be done by separate processors or controllers can be done by the same processor or controllers. Therefore, references to specific functional units or circuits are only to be considered as references to the proper means of providing the described functionality and not as indicative of a physical structure or logical or physical organization.
[0140] The invention can be implemented in many suitable forms, including hardware, software, firmware or any combination thereof. The invention may optionally be implemented at least partially as computer software running on one or more data processors and/or digital signal processors. The elements and components of an embodiment of the invention can be physically, functionally and logically deployed in any suitable way. In fact, functionality can be implemented in a single unit, in a plurality of units, or as part of other functional units. As such, the invention can be implemented in a single unit or it can be physically and functionally distributed among different units, circuits and processors.
[0141] Although the present invention has been described in conjunction with some embodiments, it is not intended to be limited to the specific form presented here. Rather, the scope of the present invention is limited only by the appended claims. Additionally, although a feature may appear to be described in conjunction with specific embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in accordance with the invention. In the claims, the term comprising does not exclude the presence of other elements or steps.
[0142] Furthermore, although individually mentioned, a plurality of means, elements, circuits or method steps can be implemented, for example, by a circuit, unit or single processor. Additionally, although individual features may be included in different claims, they may possibly advantageously be combined, and inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. The inclusion of an appeal in a category of claims also does not imply a limitation to that category, but rather indicates that the appeal is equally applicable to other claim categories, as appropriate. Furthermore, the order of features in claims does not imply any specific order in which features need to be worked, and in particular the order of individual steps in a method claim does not imply that the steps need to be done in that order. Instead, the steps can be done in any suitable order. Furthermore, singular references do not exclude a plurality. Thus, references to “a/a”, “a/a”, “first/a”, “second/a”, etc. it does not preclude a plurality. Reference signs in the claims are provided by way of illustrative example only and are not to be construed as limiting the scope of the claims in any way.

权利要求:
Claims (15)
[0001]
1. APPARATUS FOR GENERATING A SPEECH SIGNAL, the apparatus being characterized by comprising: microphone receivers (101) for receiving a plurality of microphone signals from a plurality of microphones (103); a processor (105) configured to select a microphone receiver from the microphone receivers (101) based on how much a microphone signal from the microphone signals arrives at the selected microphone receiver via a direct path and how much arrives at the receiver. microphone through reverberant paths, by determining, for each microphone signal, a speech similarity indication indicative of a similarity between the microphone signal and a non-reverberant speech signal, the processor (105) being configured to determine indicating speech similarity in response to a comparison between at least one derived property of the microphone signal and at least one reference property of the non-reverberant speech signal; and a generator (107) configured to generate the speech signal by combining the microphone signals in response to the speech similarity indications, the processor (105) being further configured to determine the speech similarity indication for a first microphone signal in response to a comparison between at least one property derived from the first microphone signal and reference properties of speech samples from a set of non-reverberant speech samples, and where the non-reverberant speech signal is a signal other than a user of the device.
[0002]
Apparatus according to claim 1, characterized in that it comprises a plurality of separate devices (201, 203, 205), each device comprising a microphone receiver for receiving at least one microphone signal from the plurality of microphone signals .
[0003]
3. APPARATUS according to claim 2, characterized in that at least one first device of the plurality of separate devices (201, 203, 205) comprises a local processor (105) for determining a first speech similarity indication for the at least one microphone signal from the first device.
[0004]
4. APPARATUS according to claim 3, characterized in that the generator (107) is implemented in a generator device (207) separate from at least the first device; and the first device comprising a transmitter (209) for transmitting the first speech similarity indication to the generator device (207).
[0005]
5. APPARATUS according to claim 4, characterized in that the generator device (207) is configured to receive speech similarity indications from each of the plurality of separate devices (201, 203, 205), and the generator ( 107, 207) is configured to generate the speech signal using a subset of microphone signals from the plurality of separate devices (201, 203, 205), the subset being determined in response to the received speech similarity indications of the plurality of separate devices (201, 203, 205).
[0006]
6. APPARATUS according to claim 5, characterized in that at least one device of the plurality of separate devices (201, 203, 205) is configured to transmit the at least one microphone signal from the at least one device to the device generator (207) only if the at least one microphone signal of the at least one device is comprised in the subset of microphone signals.
[0007]
Apparatus according to claim 5, characterized in that the generator device (207) comprises a selector (213) configured to determine the subset of microphone signals, and a transmitter (211) to transmit an indication of the subset to at least one among the plurality of separate devices (201, 203, 205).
[0008]
8. APPARATUS according to claim 1, characterized in that the speech samples from the set of non-reverberant speech samples are represented by parameters of a non-reverberant speech model.
[0009]
Apparatus according to claim 8, characterized in that the processor (105) is configured to determine a first reference property for a first speech sample of the set of non-reverberant speech samples from a speech sample signal generated by evaluating the non-reverberant speech model using the parameters of the first speech sample, and to determine the speech similarity indication for a first microphone signal from the plurality of microphone signals in response to a comparison between the derived property of the first microphone signal and the first reference property.
[0010]
10. APPARATUS according to claim 1, characterized in that the processor (105) is configured to decompose the first microphone signal among the plurality of microphone signals into a set of base signal vectors; and to determine the speech similarity indication for the first microphone signal in response to a property of the base signal vector set.
[0011]
11. APPARATUS according to claim 1, characterized in that the processor (105) is configured to determine the speech similarity indications for each segment among a plurality of segments of the speech signal, and the generator is configured to determine the parameters of combination for each segment to control how the speech signal is generated from the microphone signals.
[0012]
Apparatus according to claim 9, characterized in that the generator (107) is configured to determine the combination parameters for a segment in response to the similarity indications of at least one previous segment.
[0013]
13. APPARATUS according to claim 1, characterized in that the generator (107) is configured to select a subset of the microphone signals to combine in response to the similarity indications.
[0014]
14. METHOD FOR GENERATING A SPEECH SIGNAL, the method being characterized by comprising: receiving microphone signals from a plurality of microphones (103); select a microphone from the plurality of microphones based on how much a microphone signal from the microphone signals arrives at the selected microphone via a direct path and how much arrives at the microphone via reverberant paths, by determining, for each microphone signal, a speech similarity indication indicative of a similarity between the microphone signal and the non-reverberant speech signal, the speech similarity indication being determined in response to a comparison between at least one property derived from the microphone signal and at least one reference property for non-reverberant speech signal; and generating the speech signal by combining the microphone signals in response to the speech similarity indications, determining the speech similarity indication for a first microphone signal in response to a comparison between at least one property derived from the first signal. of microphone and reference properties of speech samples from a set of non-reverberant speech samples, and where the non-reverberant speech signal is a speech signal from someone other than a user of the device.
[0015]
15. METHOD, according to claim 14, characterized in that the act of identifying includes actions of: decomposing a first microphone signal from the plurality of microphone signals into a set of base signal vectors; and determining the speech similarity indication for the first microphone signal in response to a property of the base signal vector set.

类似技术:

公开号 | 公开日 | 专利标题

BR112015020150B1|2021-08-17|APPLIANCE TO GENERATE A SPEECH SIGNAL, AND, METHOD TO GENERATE A SPEECH SIGNAL

US10614812B2|2020-04-07|Multi-microphone speech recognition systems and related techniques

US9865265B2|2018-01-09|Multi-microphone speech recognition systems and related techniques

CN103238182B|2015-07-22|Noise reduction system with remote noise detector

JP5000647B2|2012-08-15|Multi-sensor voice quality improvement using voice state model

US20210082429A1|2021-03-18|Method and system of audio false keyphrase rejection using speaker recognition

JP2011511571A|2011-04-07|Improve sound quality by intelligently selecting between signals from multiple microphones

TW200926151A|2009-06-16|Multiple microphone voice activity detector

EP2229678A1|2010-09-22|Systems, methods, and apparatus for multi-microphone based speech enhancement

US9240190B2|2016-01-19|Formant based speech reconstruction from noisy signals

Ravanelli et al.2014|On the selection of the impulse responses for distant-speech recognition based on contaminated speech training

Jokinen et al.2016|The Use of Read versus Conversational Lombard Speech in Spectral Tilt Modeling for Intelligibility Enhancement in Near-End Noise Conditions.

US9378755B2|2016-06-28|Detecting a user's voice activity using dynamic probabilistic models of speech features

BR112014009338B1|2021-08-24|NOISE Attenuation APPLIANCE AND NOISE Attenuation METHOD

KR102051966B1|2019-12-04|Speech recognition enhancement apparatus and method

JP6268916B2|2018-01-31|Abnormal conversation detection apparatus, abnormal conversation detection method, and abnormal conversation detection computer program

CN110169082A|2019-08-23|Combining audio signals output

Tashev et al.2008|Sound capture system and spatial filter for small devices

KR20190015081A|2019-02-13|System, device and method of automatic translation

CN109257687A|2019-01-22|Hearing device and method with non-intrusive speech clarity

Srinivasan2011|Using a remotewireless microphone for speech enhancement in non-stationary noise

US20210174811A1|2021-06-10|Asynchronous ad-hoc distributed microphone array processing in smart home applications using voice biometrics

JP2004317776A|2004-11-11|Device, method, and program for sound characteristic correction, and recording medium where the program is recorded

Pacheco et al.2006|Spectral subtraction for reverberation reduction applied to automatic speech recognition

JP2011090031A|2011-05-06|Voice band expansion device and program, and extension parameter learning device and program

同族专利:

公开号 | 公开日

WO2014132167A1|2014-09-04|

EP2962300A1|2016-01-06|

CN105308681B|2019-02-12|

JP2016511594A|2016-04-14|

CN105308681A|2016-02-03|

EP2962300B1|2017-01-25|

JP6519877B2|2019-05-29|

US10032461B2|2018-07-24|

BR112015020150A2|2017-07-18|

US20150380010A1|2015-12-31|

RU2648604C2|2018-03-26|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US3814856A|1973-02-22|1974-06-04|D Dugan|Control apparatus for sound reinforcement systems|

US5561737A|1994-05-09|1996-10-01|Lucent Technologies Inc.|Voice actuated switching system|

US5638487A|1994-12-30|1997-06-10|Purespeech, Inc.|Automatic speech recognition|

JP3541339B2|1997-06-26|2004-07-07|富士通株式会社|Microphone array device|

US6684185B1|1998-09-04|2004-01-27|Matsushita Electric Industrial Co., Ltd.|Small footprint language and vocabulary independent word recognizer using registration by word spelling|

US6243322B1|1999-11-05|2001-06-05|Wavemakers Research, Inc.|Method for estimating the distance of an acoustic signal|

GB0120450D0|2001-08-22|2001-10-17|Mitel Knowledge Corp|Robust talker localization in reverberant environment|

EP1468550B1|2002-01-18|2012-03-28|Polycom, Inc.|Digital linking of multiple microphone systems|

DE60304859T2|2003-08-21|2006-11-02|Bernafon Ag|Method for processing audio signals|

CA2537977A1|2003-09-05|2005-03-17|Stephen D. Grody|Methods and apparatus for providing services using speech recognition|

CN1808571A|2005-01-19|2006-07-26|松下电器产业株式会社|Acoustical signal separation system and method|

US7260491B2|2005-10-27|2007-08-21|International Business Machines Corporation|Duty cycle measurement apparatus and method|

JP4311402B2|2005-12-21|2009-08-12|ヤマハ株式会社|Loudspeaker system|

WO2007098768A1|2006-03-03|2007-09-07|Gn Resound A/S|Automatic switching between omnidirectional and directional microphone modes in a hearing aid|

US8233353B2|2007-01-26|2012-07-31|Microsoft Corporation|Multi-sensor sound source localization|

US8411880B2|2008-01-29|2013-04-02|Qualcomm Incorporated|Sound quality by intelligently selecting between signals from a plurality of microphones|

WO2010091077A1|2009-02-03|2010-08-12|University Of Ottawa|Method and system for a multi-microphone noise reduction|

JP5620689B2|2009-02-13|2014-11-05|本田技研工業株式会社|Reverberation suppression apparatus and reverberation suppression method|

US8644517B2|2009-08-17|2014-02-04|Broadcom Corporation|System and method for automatic disabling and enabling of an acoustic beamformer|

US8447619B2|2009-10-22|2013-05-21|Broadcom Corporation|User attribute distribution for network/peer assisted speech coding|

EP2375779A3|2010-03-31|2012-01-18|Fraunhofer-Gesellschaft zur Förderung der Angewandten Forschung e.V.|Apparatus and method for measuring a plurality of loudspeakers and microphone array|

EP2572499B1|2010-05-18|2018-07-11|Telefonaktiebolaget LM Ericsson |Encoder adaption in teleconferencing system|

US8908874B2|2010-09-08|2014-12-09|Dts, Inc.|Spatial audio encoding and reproduction|

CN103282959B|2010-10-25|2015-06-03|沃伊斯亚吉公司|Coding generic audio signals at low bitrates and low delay|

EP2458586A1|2010-11-24|2012-05-30|Koninklijke Philips Electronics N.V.|System and method for producing an audio signal|

SE536046C2|2011-01-19|2013-04-16|Limes Audio Ab|Method and device for microphone selection|

US9336780B2|2011-06-20|2016-05-10|Agnitio, S.L.|Identification of a local speaker|

US8340975B1|2011-10-04|2012-12-25|Theodore Alfred Rosenberger|Interactive speech recognition device and system for hands-free building control|

US8731911B2|2011-12-09|2014-05-20|Microsoft Corporation|Harmonicity-based single-channel speech quality estimation|

US9058806B2|2012-09-10|2015-06-16|Cisco Technology, Inc.|Speaker segmentation and recognition based on list of speakers|

US20140170979A1|2012-12-17|2014-06-19|Qualcomm Incorporated|Contextual power saving in bluetooth audio|KR101904423B1|2014-09-03|2018-11-28|삼성전자주식회사|Method and apparatus for learning and recognizing audio signal|

US9922643B2|2014-12-23|2018-03-20|Nice Ltd.|User-aided adaptation of a phonetic dictionary|

KR20160089145A|2015-01-19|2016-07-27|삼성전자주식회사|Method and apparatus for speech recognition|

JP6631010B2|2015-02-04|2020-01-15|ヤマハ株式会社|Microphone selection device, microphone system, and microphone selection method|

CN105185371B|2015-06-25|2017-07-11|京东方科技集团股份有限公司|A kind of speech synthetic device, phoneme synthesizing method, the osteoacusis helmet and audiphone|

US10142754B2|2016-02-22|2018-11-27|Sonos, Inc.|Sensor on moving component of transducer|

US9811314B2|2016-02-22|2017-11-07|Sonos, Inc.|Metadata exchange involving a networked playback system and a networked microphone system|

US10095470B2|2016-02-22|2018-10-09|Sonos, Inc.|Audio response playback|

US10264030B2|2016-02-22|2019-04-16|Sonos, Inc.|Networked microphone device control|

US9820039B2|2016-02-22|2017-11-14|Sonos, Inc.|Default playback devices|

US9965247B2|2016-02-22|2018-05-08|Sonos, Inc.|Voice controlled media playback system based on user profile|

US9947316B2|2016-02-22|2018-04-17|Sonos, Inc.|Voice control of a media playback system|

US9978390B2|2016-06-09|2018-05-22|Sonos, Inc.|Dynamic player selection for audio signal processing|

US10134399B2|2016-07-15|2018-11-20|Sonos, Inc.|Contextualization of voice inputs|

US10152969B2|2016-07-15|2018-12-11|Sonos, Inc.|Voice detection by multiple devices|

US10115400B2|2016-08-05|2018-10-30|Sonos, Inc.|Multiple voice services|

US9693164B1|2016-08-05|2017-06-27|Sonos, Inc.|Determining direction of networked microphone device relative to audio playback device|

GB201615538D0|2016-09-13|2016-10-26|Nokia Technologies Oy|A method , apparatus and computer program for processing audio signals|

US9794720B1|2016-09-22|2017-10-17|Sonos, Inc.|Acoustic position measurement|

US9942678B1|2016-09-27|2018-04-10|Sonos, Inc.|Audio playback settings for voice interaction|

US9743204B1|2016-09-30|2017-08-22|Sonos, Inc.|Multi-orientation playback device microphones|

US10181323B2|2016-10-19|2019-01-15|Sonos, Inc.|Arbitration-based voice recognition|

US10621980B2|2017-03-21|2020-04-14|Harman International Industries, Inc.|Execution of voice commands in a multi-device system|

US11183181B2|2017-03-27|2021-11-23|Sonos, Inc.|Systems and methods of multiple voice services|

GB2563857A|2017-06-27|2019-01-02|Nokia Technologies Oy|Recording and rendering sound spaces|

US10475449B2|2017-08-07|2019-11-12|Sonos, Inc.|Wake-word detection suppression|

US10048930B1|2017-09-08|2018-08-14|Sonos, Inc.|Dynamic computation of system response volume|

US10446165B2|2017-09-27|2019-10-15|Sonos, Inc.|Robust short-time fourier transform acoustic echo cancellation during audio playback|

US10621981B2|2017-09-28|2020-04-14|Sonos, Inc.|Tone interference cancellation|

US10051366B1|2017-09-28|2018-08-14|Sonos, Inc.|Three-dimensional beam forming with a microphone array|

US10482868B2|2017-09-28|2019-11-19|Sonos, Inc.|Multi-channel acoustic echo cancellation|

US10466962B2|2017-09-29|2019-11-05|Sonos, Inc.|Media playback system with voice assistance|

US10880650B2|2017-12-10|2020-12-29|Sonos, Inc.|Network microphone devices with automatic do not disturb actuation capabilities|

US10818290B2|2017-12-11|2020-10-27|Sonos, Inc.|Home graph|

CN108174138B|2018-01-02|2021-02-19|上海闻泰电子科技有限公司|Video shooting method, voice acquisition equipment and video shooting system|

US11175880B2|2018-05-10|2021-11-16|Sonos, Inc.|Systems and methods for voice-assisted media content selection|

US10847178B2|2018-05-18|2020-11-24|Sonos, Inc.|Linear filtering for noise-suppressed speech detection|

US10959029B2|2018-05-25|2021-03-23|Sonos, Inc.|Determining and adapting to changes in microphone performance of playback devices|

US10681460B2|2018-06-28|2020-06-09|Sonos, Inc.|Systems and methods for associating playback devices with voice assistant services|

US11076035B2|2018-08-28|2021-07-27|Sonos, Inc.|Do not disturb feature for audio notifications|

US10461710B1|2018-08-28|2019-10-29|Sonos, Inc.|Media playback system with maximum volume setting|

US20210329390A1|2018-09-13|2021-10-21|Cochlear Limited|Hearing performance and habilitation and/or rehabilitation enhancement using normal things|

US10878811B2|2018-09-14|2020-12-29|Sonos, Inc.|Networked devices, systems, and methods for intelligently deactivating wake-word engines|

US10587430B1|2018-09-14|2020-03-10|Sonos, Inc.|Networked devices, systems, and methods for associating playback devices based on sound codes|

US11024331B2|2018-09-21|2021-06-01|Sonos, Inc.|Voice detection optimization using sound metadata|

US10811015B2|2018-09-25|2020-10-20|Sonos, Inc.|Voice detection optimization based on selected voice assistant service|

US11100923B2|2018-09-28|2021-08-24|Sonos, Inc.|Systems and methods for selective wake word detection using neural network models|

US10692518B2|2018-09-29|2020-06-23|Sonos, Inc.|Linear filtering for noise-suppressed speech detection via multiple network microphone devices|

EP3654249A1|2018-11-15|2020-05-20|Snips|Dilated convolutions and gating for efficient keyword spotting|

US11183183B2|2018-12-07|2021-11-23|Sonos, Inc.|Systems and methods of operating media playback systems having multiple voice assistant services|

US11132989B2|2018-12-13|2021-09-28|Sonos, Inc.|Networked microphone devices, systems, and methods of localized arbitration|

US10602268B1|2018-12-20|2020-03-24|Sonos, Inc.|Optimization of network microphone devices using noise classification|

US10867604B2|2019-02-08|2020-12-15|Sonos, Inc.|Devices, systems, and methods for distributed voice processing|

CN113710334A|2019-04-26|2021-11-26|索尼互动娱乐股份有限公司|Information processing system, information processing apparatus, control method for information processing apparatus, and program|

US11120794B2|2019-05-03|2021-09-14|Sonos, Inc.|Voice assistant persistence across multiple network microphone devices|

US11200894B2|2019-06-12|2021-12-14|Sonos, Inc.|Network microphone device with command keyword eventing|

US10586540B1|2019-06-12|2020-03-10|Sonos, Inc.|Network microphone device with command keyword conditioning|

US10871943B1|2019-07-31|2020-12-22|Sonos, Inc.|Noise classification for event detection|

US11138969B2|2019-07-31|2021-10-05|Sonos, Inc.|Locally distributed keyword detection|

US11138975B2|2019-07-31|2021-10-05|Sonos, Inc.|Locally distributed keyword detection|

US11189286B2|2019-10-22|2021-11-30|Sonos, Inc.|VAS toggle based on device orientation|

US20210127220A1|2019-10-25|2021-04-29|Magic Leap, Inc.|Reverberation fingerprint estimation|

US11217235B1|2019-11-18|2022-01-04|Amazon Technologies, Inc.|Autonomously motile device with audio reflection detection|

US11200900B2|2019-12-20|2021-12-14|Sonos, Inc.|Offline voice control|

法律状态:
2018-11-13| B06F| Objections, documents and/or translations needed after an examination request according [chapter 6.6 patent gazette]|

2019-08-13| B25A| Requested transfer of rights approved|Owner name: MEDIATEK INC. (TW) |

2020-06-02| B06U| Preliminary requirement: requests with searches performed by other patent offices: procedure suspended [chapter 6.21 patent gazette]|

2021-07-06| B09A| Decision: intention to grant [chapter 9.1 patent gazette]|

2021-08-17| B16A| Patent or certificate of addition of invention granted [chapter 16.1 patent gazette]|Free format text: PRAZO DE VALIDADE: 20 (VINTE) ANOS CONTADOS A PARTIR DE 18/02/2014, OBSERVADAS AS CONDICOES LEGAIS. |

优先权:

申请号 | 申请日 | 专利标题

US201361769236P| true| 2013-02-26|2013-02-26|

US61/769,236|2013-02-26|

PCT/IB2014/059057|WO2014132167A1|2013-02-26|2014-02-18|Method and apparatus for generating a speech signal|

[返回顶部]