专利摘要:
A voice detection method for detecting the presence of speech signals in a noisy acoustic signal x (t) from a microphone, comprising the following successive steps: - calculating a detection function FD (τ) based on calculating a difference function D (τ) varying as a function of the offset T on an integration window of length W beginning at the time t0, with: a step of adapting the threshold in said current interval, as a function of the values maximum of the acoustic signal x (t) established in said current range; search for the minimum of the detection function FD (τ) and comparison of this minimum with a threshold, for τ varying in a determined time interval said current interval to detect the presence or absence of a fundamental frequency F0 characteristic of a speech signal in said current range.
公开号:FR3014237A1
申请号:FR1361922
申请日:2013-12-02
公开日:2015-06-05
发明作者:Karim Maouche
申请人:ADEUNIS RF;
IPC主号:
专利说明:

[0001] The present invention relates to a voice detection method for detecting the presence of speech signals in a noisy acoustic signal from a microphone. It relates more particularly to a voice detection method used in a single-sensor wireless audio communication system. The invention is in the specific field of voice activity detection, generally referred to as "VAD" for Voice Activity Detection, which consists of detecting speech, in other words speech signals, in an acoustic signal derived from a microphone. The invention finds a preferred, but not limiting, application with a multi-user wireless audio communication system, of the type of time-division or full-duplex communication system, between several autonomous communication terminals, that is to say ie without connection to a transmission base or network, and simple to use, that is to say without intervention of a technician to establish communication. Such a communication system, in particular known from the documents W010149864 A1, WO10149875 A1 and EP1843326 A1, is conventionally used in a noisy or very noisy environment, for example in a marine environment, in the context of a show or a sporting event. indoor or outdoor, on a construction site, etc. The detection of voice activity generally consists in delimiting, by means of quantifiable criteria, the beginning and end of words and / or sentences in a noisy acoustic signal, in other words in a given audio stream. Such detection finds applications in areas such as speech coding, noise reduction or speech recognition. The implementation of a method for detecting the voice in the processing chain of an audio communication system makes it possible in particular not to transmit an acoustic or audio signal during the periods of silence. Therefore, the surrounding noise will not be transmitted during these periods, for the sake of improving the audio rendering of the communication or to reduce the transmission rate. For example, in speech coding, it is known to employ voice activity detection to encode the audio signal in a full manner only when the "VAD" method indicates activity. Therefore, when there is no speech and we are in a period of silence, the coding rate drops significantly, which on average, over the entire signal, allows to achieve lower rates. There are thus many methods for detecting the activity of the voice, but these perform poorly or do not work at all in the context of a noisy or very noisy environment, such as in a sports meeting environment (outdoors). or indoors) with referees who need to communicate in audio and wireless. Indeed, known methods for detecting voice activity give poor results when the speech signal is tainted with noise.
[0002] Among the known voice activity detection methods, some implement a detection of the fundamental frequency characteristic of a speech signal. In the case of a speech signal, said signal or its voiced sound, the signal indeed has a so-called fundamental frequency, generally called "pitch", which corresponds to the vibrational frequency of the vocal cords of the speaker, and which typically extends between 70 and 400 Hertz. The evolution of this fundamental frequency determines the melody of speech and its extent depends on the speaker, his habits but also his physical and mental state. Thus, to perform the detection of a speech signal, it is known to assume that such a speech signal is quasi-periodic and that, as a result, a correlation or a difference with the signal itself but offset will present maximums or minima in the vicinity of the fundamental frequency and its multiples. The paper "YIN, a fundamental frequency estimator for speech and music," by Alain De Cheveigne and Hideki Kawahara, Journal of the Acoustical Society of America, Vol. 111, No. 4, pp. 1917-1930, April 2002, proposes and develops a method based on the difference between the signal and the same time-shifted signal. Several methods described below are based on the detection of the fundamental frequency of the speech signal or pitch in a noisy acoustic signal x (t). A first method of detecting the fundamental frequency implements the search for the maximum of the autocorrelation function R (t) defined by the following relation: N-1-i 1 R (t) = N x (n) x (n + -c), <<max (t). n = 0 This first method employing the autocorrelation function does not give satisfaction, however, when there is a presence of relatively large noise. Moreover, the autocorrelation function suffers from the presence of maxima that do not correspond to the fundamental frequency or its multiples, but to sub-multiples of it. A second method of detecting the fundamental frequency implements the search for the minium of the difference function D (c) defined by the following relation: N-1-r 1 n = 0 Ix (n) -x (n + '01 , 0 <i <max (r), D (c) = -N where II is the absolute value operator, this difference function being minimal in the vicinity of the fundamental frequency and its multiples, then the comparison of this minimum with a threshold to deduce the decision of presence of voice or not.Prepared to the autocorrelation function R (T), the difference function D (c) has the advantage of offering a lower computing load, Making this second method more interesting for real-time applications, however, this second method also does not give complete satisfaction when there is presence of noise A third method of detecting the fundamental frequency implements the calculation, considering a window of treatment of length where H <N, of the square difference function clt (r) defined by the relation: dt (T) = Eit + tH-1 (Xi - Xi + T), Then we continue with the search for the minimum of the square difference function dt (r), this square difference function being minimal in the vicinity of the fundamental frequency and its multiples, and finally the comparison of this minimum with a threshold to deduce the decision of presence of voice or not. A known improvement of this third method consists in normalizing the square difference function clt (r) by calculating a normalized square difference function dt (T) corresponding to the following relation: 30142 3 7 4 1, if t = 0 di (T) = dt (T) otherwise (7) EE; -1 .1 dr (i) Although exhibiting better noise immunity and giving better detection results in this context, this third method has limitations in terms of detection of voice, particularly in low SNR (Signal to Noise) noise areas characteristic of a very noisy environment. The state of the art can also be illustrated by the teaching of the patent application FR 2,825,505 which implements the third method of detecting the aforementioned fundamental frequency, for the extraction of this fundamental frequency. In this patent application, the normalized square difference function dt (T) can be compared with a threshold to determine this fundamental frequency - this threshold can be fixed or vary as a function of time offset T - and this method has the aforementioned drawbacks associated with this third method. It is an object of the present invention to provide a voice detection method which provides detection of speech signals contained in a noisy acoustic signal, particularly in noisy or even very noisy environments. It more particularly proposes a voice detection method which is very suitable for communication (in particular between referees) within a stage where the noise is relatively very high in level and is strongly non-stationary, with stages. detection that especially avoid the bad or false detections (generally called "tonches") due to the songs of the spectators, drums, music and whistles. For this purpose, it proposes a voice detection method 25 for detecting the presence of speech signals in a noisy acoustic signal x (t) from a microphone, comprising the following successive steps: - calculation of a function detection method FD (T) based on the calculation of a difference function D (c) varying as a function of the offset T on an integration window 30 of length W beginning at the time t0, with: D (r) - Et0 + W -1lx (n) - x (n + T) I where 0 T max (T); n = t0 - search for the minimum of the detection function FD (T) and comparison of this minimum with a threshold, for T varying in a determined time interval said current interval to detect the presence or absence of a fundamental frequency Fo characteristic a speech signal in said current interval; said method being remarkable in that it comprises, before the step of searching and comparing, a stage of adaptation of the threshold in said current interval, as a function of values calculated from the acoustic signal x (t) established in at least a time interval preceding said current interval, and in particular maximum values of said acoustic signal x (t). Thus, this method is based on the principle of an adaptive threshold, which will be relatively low during periods of noise or silence and relatively high during speech periods. As a result, false detections will be minimized and speech will be detected correctly with a minimum of cuts at the beginning and end of words. According to a first possibility, the detection function FD (T) corresponds to the difference function D (c). According to a second possibility, the detection function FD (T) corresponds to the normalized difference function M (c) calculated from the difference function D (c) as follows: DN (T) = 1 Si T, DN (T) ) = D (T) If T 0. (11T) T-1D (J) It is of course advantageous to carry out the method on a sampled acoustic signal, ie the method incorporates a pre-sampling step comprising a cutting of the acoustic signal x (t) in a discrete acoustic signal {xi} composed of a sequence of vectors associated with i-time frames of length N, N corresponding to the number of sampling points, where each vector translates the acoustic content of the i associated frame and is composed of N samples X (i-1) N + 1, X (i-1) N + 2, XiN-1, XiN, integer positive, so that: the calculation of the function of FD (T) detection consists of a calculation of a discrete detection function FDi (r) associated with i frames; the adaptation of the threshold consists of, for each frame i, adapting a threshold C1 specific to the frame i as a function of reference values calculated from the values of the samples of the discrete acoustic signal {A} in said frame i; the search for the minimum of the detection function FD (T) and the comparison of this minimum with a threshold are carried out by searching, on each frame i, the minimum rr (i) of the discrete detection function FDi (r) and comparing this minium rr (i) with the threshold Cl specific to the frame i. In a particular embodiment, the discrete difference function Di (u) relative to the frame i is calculated as follows: the field i is subdivided into K subframes of length H, with for example [N-max (T) J where K = H or I_J represents the integral rounding operator, so that the samples of the discrete acoustic signal {A} in a subscript of index p of the frame i comprise the H samples: x ( i_i) N + (p_i) H + 1, X (i-1) N + (p-1) H + 2, - - -, X (i-1) N + pH, p positive integer between 1 and K; For each subfield of index p, the following difference function ddp (r) is calculated: er) = .0-1) N + pH dd x-, ZJJ = (i-1) N + (p-1) H + 11) (i - Xi-FT - the discrete difference function Di (u) relative to the frame i is calculated as the sum of the difference functions ddp (c) of the sub-frames of index p of the frame 20 i either: Di (T) = E11; = 1ddp (T) In the case of the second possibility mentioned above, the calculation of the normalized difference function DN (T) consists of a calculation of a discrete normalized difference function DNi (r) associated with the frames i, where: DNi (r) = 1 If T =, DNi (r) = Di (t) Si T # 0. (11-c) T-1Dia) Advantageously, the stage of adaptation of the thresholds C1 for each frame i comprises the following steps: a) - the frame i is subdivided comprising N sample points into T sub-frames of length L, where N is a multiple of T so that the length L = N / T is entire, and so that the samples of the discrete acoustic signal {A} in a subtotal rame of index j of the frame i comprise the following L samples: X0_1) N + (jj-1) L +1, X (i-1) N + (jj-1) L + 2, ---, X (i -1) N + ji_, j positive integer between 1 and T; b) - the maximum values mu of the discrete acoustic signal {A} are calculated in each subfield of index j of the frame i, with: = max {x (i-1) NF (0) L +1, X (i-1) N + (j-1) L + 2, ---, X (i-1) N + jLE c) - at least one reference value Refij, MRefi j specific to the subframe is calculated j of the frame i, the or each reference value Refij, MRefi j by subframe j being calculated from the maximum value mu in the subframe j of the frame i; d) - the value of the threshold C1 specific to the frame i is established as a function of all the reference values Refij, MRefi j calculated in the subframes j of the frame i. Thus, the maximum values mu established in the 15 subframes j are considered to make the decision (voice or absence of voice) on the entire frame i. According to one characteristic, during step c), the following sub-steps are carried out on each frame i: c1) -on calculates the smoothed envelopes of maximum rrii, i in each sub-frame of index j of the frame i, with: mi j = + (1 -, where A is a predefined coefficient between 0 and 1; c2) -on calculates the variation signals Ai, j in each subfield of index j of the frame i, with: ll 25 -ij Mi j Mi, j (mi and where at least one so-called principal reference value Refi, j by subframe j is calculated from the variation signal Ai, j in the sub-frame j of the Thus, we consider the variation signals Ai, j established in the subframes j to make the decision (voice or absence of voice) on the entire frame, making reliable the detection of the speech (or voice). According to another characteristic, during step c) and following substep c2), the following sub-steps are performed on each frame i: c3) -on calculates the maximums of v ariation s;,; in each subscript of index j of the frame i, where s;,; corresponds to the maximum of the variation signal Ai j calculated on a sliding window of length Lm prior to said subframe j, said length Lm being variable depending on whether the sub-frame j of the frame i corresponds to a period of silence or presence of speech; c4) -on calculates the variations of variation b;,; in each subscript of index j of the frame i, with: Si4 = Ai4 - si4; and wherein for each subframe j of the frame i, two main reference values Refi j are calculated from the variation signal A i j and the variation difference b i, respectively. Thus, the variation signals Ai j and the variation deviations b i, are considered jointly; established in the subframes j to select the value of the adaptive Qi threshold and thus make the decision (voice or absence of voice) on the entire frame i, enhancing the detection of speech. In other words, we study the pair (Ai; Eq) to determine the value of the adaptive Qi threshold. Advantageously, during step c) and following substep c4), a substep c5) for calculating the normalized variation signals A'i j and standard deviations of variation b '; ; in each sub-frame of index j of the frame i, as follows: mij-mi mi, 1 S "= ij 14 and where, for each sub-frame j of a frame i, the normalized variation signal j and the normalized variation deviation b ',; each constitute a main reference value Refi j so that, in step d), the value of the threshold Qi specific to the frame i is set as a function of the the normalized variation signals A'i j and standardized variation deviations b ', in the subframes j of the frame i, in this way, the variation of the threshold Qi independently of the levels of the signals Ai j and b;,; by normalizing them with the calculation of the normalized signals A'i j and b ';. Thus, the thresholds Qi chosen from these normalized signals A'i j and b ', will be independent of the level of the discrete acoustic signal {A}, that is, the pair (A'i; j) is studied to determine the value of the adaptive threshold T. Advantageously, when the step d), the value of the threshold Cl specific to the frame i is established by partitioning the space defined by the value of the pair (A'i j, j), and examining the value of the pair (A'i j, j ) on one or more (for example between one and three) successive subframes according to the value zone of the pair (A'i j, j). Thus, the procedure for calculating the threshold C1 is based on an experimental partition of the space defined by the value of the pair (A'i j, j). To this is added a decision mechanism that scrutinizes the value of the pair (A'i j, j) on one, two or more successive subframes according to the value area of the pair. The conditions for testing the positioning of the torque value (A'i j, j) depend mainly on the speech detection during the previous frame and the scanning mechanism on the one, two or more successive subframes also uses an experimental partitioning. According to one characteristic, during the sub-step c3), the length Lm of the sliding window corresponds to the following equations: Lm = LO if the sub-frame j of the frame i corresponds to a period of silence; Lm = L1 if the sub-frame j of the frame i corresponds to a speech presence period; with L1 <LO, and in particular with L1 = k1.L and LO = k0.L, L being the length of subframes of index j and kO, k1 being positive integers. According to another characteristic, during the sub-step c3), for each calculation of the maximum variation s; in the sub-frame j of the frame i, the sliding window of length Lm is delayed by Mm frames of length N vis-à-vis said sub-frame j. According to another characteristic, the following improvements are carried out: in sub-step c3), the normalized maximums of variation s'; in each subscript of subscript j of the frame i, where i i corresponds to the maximum of the normalized variation signal A'i i calculated on a sliding window of length L m prior to said subframe j, where: and where each maximum normalized variation s'; , j is calculated according to a minimization method including the following iterative steps: - computation of = max; ~~ i-Mm, j} and = max; ~~ i-Mm, j} j - if rem (i, Lm) = 0, where rem is the remainder operator of the integer division of two integers, then: = max; i-mm, i with s'0,1 = 0 and s'0,1 = 0; and - during step c4), the standardized variation differences b 'are calculated; , j in each subfield of index j of the frame i, as follows: = -S i, j Advantageously, during step c), a sub-step c6) is performed in which the maximums of maximum q; , j in each subscript of index j of the frame i, where q; , j corresponds to the maximum of the maximum value mu calculated on a sliding window of fixed length Lq before said sub-frame j, where the sliding window of length Lq is delayed by Mq frames of length N opposite said subframe j, and where another so-called secondary reference value MRefi j per subframe j corresponds to said maximum maximum q; , j in the sub-frame j of the frame i. Thus, to avoid false detections, it is advantageous to also take into account this signal q; , j (secondary reference value MRefi = qii) which is calculated in a manner similar to the calculation of the signal s; above, but which operates on the maximum values mu instead of operating on the variation signals D; , j or on the normalized variation signals 0 ', j. In a particular embodiment, during step d), the threshold specific to the frame i is divided into several sub-thresholds 0 specific to each sub-frame j of the frame i, and the value of each sub-threshold 0 is at least set according to the reference value (s) Refi j, MRefi calculated in the sub-frame j of the corresponding frame i. Thus, we have = {Qo; 0i, 2; ; 0i, T}, translating the division of the threshold into several sub-thresholds 0 specific to the sub-frames j, providing additional finesse in the establishment of the adaptive threshold Cl.
[0003] Advantageously, during step d), the value of each threshold Qu specific to the sub-frame j of the frame i is established by comparing the values of the pair (A'i j, j) with several pairs of thresholds. fixed, the value of each threshold Qu being selected from several fixed values as a function of the comparisons of the torque (A'i j, j) with said fixed threshold pairs. These pairs of fixed thresholds are for example determined experimentally by a distribution of the space of the values (A'i j, j) into decision zones. In a complementary manner, the value of each threshold 10 Qu specific to the frame i subframe i is also established by performing a comparison of the pair (A'i j, j) on one or more successive subframes according to the initial zone of the couple (A'i j, j). The conditions for testing the positioning of the value of the pair j, j) depend on the speech detection during the previous frame and the comparison mechanism on the successive subframe (s) also uses an experimental partitioning. Of course, it is also conceivable that the value of each threshold Qu specific to the field i subframe j be determined by comparing: - the values of the pair (A'i j, j) (the main reference values Refi j) 20 with several pairs of fixed thresholds; - the values of q;,; (the secondary reference value MRefi j) with several other fixed thresholds. Thus, the decision mechanism based on comparing the pair (A'i j, j) with fixed threshold pairs is supplemented by another decision mechanism based on the comparison of q; with other fixed thresholds. Advantageously, during step d), a so-called decision procedure is carried out comprising the following substeps, for each frame i: for each sub-frame j of the frame i, a decision index DECi is established ( j) which occupies either a state "1" of detection of a speech signal or a state "0" of detection of a speech signal; a temporary decision VAD (i) is established based on the comparison of decision indices DECi (j) with logical "OR" operators, so that the temporary decision VAD (i) occupies a state "1" of detection of a speech signal if at least one of said decision indices DECi (j) occupies this state "1" of detection of a speech signal. Thus, in order to avoid late detections (word clipping at the beginning of detection), the final decision (voice or no voice) is taken following this decision procedure, based on the temporary decision VAD (i) which is itself taken on the entire frame i, by implementing a logic "OR" operator on the decisions taken in the sub-frames j, and preferably in successive sub-frames j over a short and finite horizon from from the beginning of the frame i.
[0004] During this decision procedure, it is possible to perform the following sub-steps, for each frame i: - a lastmax threshold value is stored which corresponds to the variable value of a comparison threshold for the amplitude of the discrete acoustic signal {A} below which it is considered that the acoustic signal does not comprise a speech signal, this variable value being determined during the last frame of index k which precedes said frame i and in which the temporary decision VAD (k) occupied a state "1" for detecting a speech signal; a maximum average value Au is stored which corresponds to the average maximum value of the discrete acoustic signal {A} in the sub-frame j of the frame i calculated as follows: Ai4 = 0 + (1 - where ai corresponds to the maximum of discrete acoustic signal {A} contained in a frame k formed by the subframe j of the frame i and by at least one or more successive subframes that precede said subframe j; and 25 0 is a predefined coefficient between 0 and 1 with 0 <- the value of each sub-threshold Qu is established as a function of the comparison between said maximum threshold value Lastmax and average maximum values Au and considered over two successive sub-frames j and j-1. In many cases, the false detections arrive with a lower amplitude than the speech signal (the microphone being located next to the mouth of the communicator) .Thus, this decision procedure aims at eliminating even more the bad deeds. by storing the maximum Lastmax threshold value of the updated speech signal in the last activation period and the average maximum values and corresponding to the average maximum value of the discrete acoustic signal {xi} in the subframes j and j 1 of the frame i. Taking into account these values (Lastmax, Au and Ai, o), we add a condition at the level of the establishment of the adaptive Cl threshold. It is important that the value of 0 be chosen to be lower than the coefficient λ to slow the fluctuations of Aij. In the decision procedure mentioned above, the maximum threshold value Lastmax is updated each time the process has considered that a sub-frame p of a frame k contains a speech signal, by implementing the following procedure: the detection of a speech signal in the subframe p of the frame k follows a period of absence of speech, and in this case Lastmax takes the present value [a (Ak, p + LastMax)], where a is a predefined coefficient between 0 and 1, and for example between 0.2 and 0.7; the detection of a speech signal in the sub-frame p of the frame k follows a period of presence of speech, and in this case Lastmax takes the updated value Ak, p if Ak, p> Lastmax. Updating the Lastmax value is thus done only during the periods of activation of the method (ie the detection periods of the voice). In a speech detection situation, the value Lastmax will be worth Ak, p when Ak, p> LastMax. However, it is important that this update be done as follows when activating the first p-subframe that follows a silence zone: the value Lastmax will be worth [a (Ak, p + LastMax)]. This mechanism of updating the Lastmax maximum threshold value allows the process to detect the user's voice even if the user has reduced the intensity of his voice (ie speaking less loud) compared to the last time the process detected that he had spoken. In other words, to further improve the elimination of false detections, fine processing is performed in which the Lastmax maximum threshold value is variable and is compared with the average maximum values Au and the discrete acoustic signal. Indeed, distant voices could be picked up with the method, since such voices have fundamental frequencies that can be detected, just like the voice of the user. To ensure that the distant voices, which can be troublesome in several use cases, are not taken into account by the process, a processing is considered in which the average maximum value of the signal is compared (on two successive frames ), in this case Au and Aij_i, with Lastmax which constitutes a variable threshold according to the amplitude of the voice of the user measured at the last activation. Thus, the value of the threshold C1 is fixed at a very low minimum value, when the signal will be below the threshold. This condition for establishing the value of the threshold Cl as a function of the maximum threshold value Lastmax is advantageously based on the comparison between: the maximum threshold value Lastmax; and - the values [Kp.Ad and [Kp. where Kp is a fixed weighting coefficient between 1 and 2. In this way, the maximum threshold value Lastmax is compared with the average maximum values of the discrete acoustic signal {xi} in subframes j and j-1 ( Au and Ai, o) weighted with a weighting coefficient Kp of between 1 and 2, to enhance the detection. This comparison is done only when the previous frame did not give rise to a detection of 20 voices. Advantageously, the method further comprises a so-called blocking phase comprising a step of switching from a state of non-detection of a speech signal to a detection state of a speech signal after having detected the presence of a speech signal on Np successive time frames. Thus, the method implements a step of the hangover type configured in such a way that the transition from a voiceless situation to a situation with voice presence occurs only after successive Np frames with presence of voices. Likewise, the method further comprises a so-called blocking phase comprising a step of switching from a detection state of a speech signal to a non-detection state of a speech signal after detecting no presence of a speech signal. a signal voiced over NA successive time frames i. Thus, the method implements a hangover type step 35 configured in such a way that the transition from a situation with voice presence to a voiceless situation occurs only after NA successive voiceless frames. Without these tipping steps, the process may punctually cut the acoustic signal during the sentences or even in the middle of the words spoken. To remedy this, these failover steps implement a blocking or hangover step on a given series of frames. According to a possibility of the invention, the method comprises a step of interrupting the blocking phase in decision areas intervening at the end of words and in a non-noisy situation, said decision areas being detected by analyzing the minimum rr (i) the discrete detection function FDi (r). Thus, the blocking phase is interrupted at the end of a sentence or word during a particular detection in the decision space. This interruption occurs only in a situation with little or no noise. As such, the method provides for isolating a particular decision area that occurs only at the end of words and in a non-noisy situation. To reinforce the detection decision of this zone, the method also uses the minimum rr (i) of the discrete detection function FDi (r), where the discrete detection function FDi (r) corresponds to either the discrete difference function Di ( u) the discrete normalized difference function DNi (r). As a result, the voice will be cut faster at the end of the speech, giving the system better audio quality. The invention also relates to a computer program 25 comprising code instructions able to control the execution of the steps of the voice detection method as defined above when it is executed by a processor. The invention further relates to a recording data recording medium on which a computer program as defined above is stored. Another object of the invention is to provide a computer program as defined above over a telecommunication network with a view to downloading it. Other characteristics and advantages of the present invention will appear on reading the following detailed description of an example of non-limiting implementation, with reference to the appended figures in which: FIG. 1 is a diagram synoptic of the process according to the invention; FIG. 2 is a schematic view of a limiting loop implemented by a decision blocking step called the hangover type step; FIG. 3 illustrates the result of a voice detection method using a fixed threshold with, at the top, a representation of the curve of the minimum rr (i) of the detection function and the fixed threshold line Ofix and at the bottom, a representation of the discrete acoustic signal {xi} and of the output signal DF; ; FIG. 4 illustrates the result of a voice detection method according to the invention by using an adaptive threshold with, at the top, a representation of the curve of the minimum rr (i) of the detection function and the adaptive threshold line S2i and, below, a representation of the discrete acoustic signal {xi} and the output signal DFi. The description of the voice detection method is made with reference to FIG. 1, which schematically illustrates the succession of the different steps necessary for detecting the presence of speech (or voice) signals in a noisy acoustic signal x (t) originating from a single microphone operating in a noisy environment. The method starts with a preliminary sampling step 101 comprising cutting the acoustic signal x (t) into a discrete acoustic signal {xi} composed of a sequence of vectors associated with i-time frames of length N, N corresponding to the number sampling points, where each vector translates the acoustic content of the associated frame i and is composed of N samples X (i-1) N + 1, X (i-1) N + 2, XiN-1, XiN, i positive integer: For example, the noisy acoustic signal x (t) is cut into frames of 240 or 256 samples, which at a sampling frequency Fe of 8 kHz corresponds to time frames of 30 or 32 milliseconds . The method continues with a step 102 for calculating a discrete difference function Di (u) relative to the frame i is calculated as follows: - each field i is subdivided into K subframes of length H, with the following relation: [N-max, K = H or LJ represents the full-round rounding operator, so that the samples of the discrete acoustic signal {A} in a sub-frame of index p of the frame i include the following samples: X0_1) + (p-1) H + 1, X (i-1) N + (p-1) H + 2, - - -, X (i-1) N + pH, p positive integer between 1 and K; then - for each subfield of index p, the following difference function ddp (T) is calculated: 10 dd x-, (T) =. (i-1) N + pH = (i-1) N + (p -1) H + 11 Xi - Xi + 11 '- the discrete difference function Di (u) relative to the frame i is calculated as the sum of the difference functions ddp (r) of the sub-frames of index p of the frame i either: Di (T) = EpK = iddp (T). It is also possible that step 102 also includes calculating a discrete normalized difference function DNi (r) from the discrete difference function Di (u), as follows: DNi (T) = If T = 0, DNi (C) = Di (t) Si T # 0. 4-1 Di O) The process continues with a step 103 in which, for each field i: - the field i comprising N sample points in T subframes of length L is subdivided, where N is a multiple of T so that the length L = N / T is entire, and so that the samples of the discrete acoustic signal {A} in a subscript of subscript j of the frame i comprise the following L samples: X ( i4) N + (jj-1) L + 1, X (i-1) N + (jj-1) L + 2, - - -, X (i-1) N + jL, j positive integer between 1 and T ; b) - the maximum values mu of the discrete acoustic signal {xi} are calculated in each subfield of index j of the frame i, with: 30 mi, l = max 1, X (i-1) N + (j- 1) L + 2, - - -, X (i-1) N + jLE By way of example, each frame i of length 240 (ie N = 240) is subdivided into four subframes j of lengths 60 ( T = 4, and L = 60). Then, in a step 104, the smoothed envelopes of the maximum rrii i in each subfield of index j of the frame i, defined as: mi = + (1 -, where A is a predefined coefficient between 0 and then, in a step 105, the variation signals A.sub.i are calculated in each subfield of index j of the frame i, defined by: Ai, i = mi, i mi, i = A (mi, i Then, in a step 106, the normalized variation signals A'i j defined by: mij- mi, 1 mi, 1 are calculated. Then, in a step 107, the maximums of variation s i are calculated in each subset. frame of index j of the frame i, where s;,; corresponds to the maximum of the variation signal Ai j calculated on a sliding window of length Lm prior to said subframe J. At this step 106, the length Lm is variable according to whether the sub-frame j of the frame i corresponds to a period of silence or presence of speech, with: - Lm = LO if the sub-frame j of the frame i corresponds to a period of silence: - Lm = L1 if the sub-frame j of the frame i corresponds to a speech presence period; with L1 <LO. By way of example, L1 = k1.L and L0 = k0.L, L being for recall the length of subframes of index j and kO, k1 being positive integers with 25 kl <k0. In addition, the sliding window length Lm is delayed by Mm frames of length N vis-à-vis said subframe j. During this step 106, the normalized maximums of variation s'; in each sub-frame of index j of the frame i, where: It is conceivable to calculate the normalized maximums of variation ij according to a minimization method comprising the following iterative steps: - calculation of i4 = max ; ~~ i-Mm, j} and 5 "i4 = max; ~~ i -Mm, j} - if rem (i, Lm) = 0, where rem is the remainder operator of the integer division of two integers, then: = max; 'i-mm, i finsi with s'0,1 = 0 and 5'0,1 = O. Then, in a step 108, we compute the variation variances b;, j in each sub- frame of index j of the frame i, defined by: Si4 = Ai4-sit In this same step 108, the standardized deviations of variation ô'ij are calculated in each subfield of index j of the frame i, defined by: sij mi j-mi-si-Si = - mi, j mi, i Next, in a step 109, the maximums of maximum q i, j are calculated in each subfield of index j of the frame i where q;, j is the maximum of the maximum value mu calculated on a sliding window of fixed length Lq prior to said subframe j, where the sliding window of length Lq is delayed by Mq frames of length N vis-à-vis said sub-frame j. Advantageously, Lq> LO, and in particular Lq = kq.L with kq a positive integer and kq> kO. Moreover, we have Mq> Mm. During this step 109, it is conceivable to calculate the maximums of maximum q; , j according to a minimization method comprising the following iterative steps: - calculation of qi j = max; mi_Mq, j} and qi J = max; mi_Mq, j} - if rem (i, Lq) = 0, where rem is the remainder operator of the integer division of two integers, then: 30 qi j = max {114_1; mi_Mq, j mi -Mm4 30142 3 7 20 - finsi with q0,1 = 0 and chu = 0. Then, in a step 110, the values of thresholds Cl specific to each frame i are set, among several fixed values Cla, Ob , Oc, etc. In a finer way, the values of the sub-thresholds or specific to each sub-frame j of the frame i are established, the threshold being divided into several sub-thresholds Or. By way of example, each threshold or sub-threshold Ou takes a fixed value chosen from six fixed values Cla, Ob, Oc, S2d, Ste, 0f, these fixed values being for example between 0.05 and 1, and in particular between 0.1 and 0.7. Each threshold or sub threshold Ou is fixed at a fixed value fla, Ob, Oc, Od, Ste, Of by the implementation of two analyzes: - first analysis: the comparison of the values of the pair (A'i j, j) in the subscript of index j of the frame i with several pairs of fixed thresholds; - second analysis: the comparison of maximums of maximum q;,; in the subfield of index j of the frame i with fixed thresholds. Following these two analyzes, a so-called decision procedure will give the final decision on the presence of the voice in the frame i. This decision procedure comprises the following substeps, for each frame i: 20 - for each sub-frame j of the frame i, a decision index DECi (j) is established which occupies a state "1" of detection of a speech signal is a state "0" for detecting a speech signal; a temporary decision VAD (i) is established based on the comparison of decision indices DECi (j) with logical "OR" operators, so that the temporary decision VAD (i) occupies a state "1" of detection of a speech signal if at least one of said decision indices DECi (j) occupies this state "1" of detection of a speech signal, that is to say the following relationship: VAD (i) = DEC; 1) + DEC; (2) + + DECi (T), where "+" is the "OR" operator. Thus, as a function of the comparisons made during the first and second analyzes, and according to the state of the temporary decision VAD (i), the threshold C1 is fixed at one of the fixed values Cla, Ob, Oc, S2d , Ste, Of and we deduce the final decision by comparing the minimum rr (i) with the threshold Cl fixed at one of its fixed values (see description below). In many cases, the false detections (or tones) arrive with an amplitude lower than that of the speech signal, the microphone being located next to the mouth of the user. Taking into account this fact, it is conceivable to further eliminate the false detections by storing the maximum threshold value Lastmax deduced from the speech signal in the last activation period of the "VAD" and adding a condition in the process based on this Lastmax maximum threshold value. Thus, in the step 109 described above, the storage of the Lastmax maximum threshold value which corresponds to the variable (or actualized) value of a comparison threshold for the amplitude of the discrete acoustic signal {A} is added. below which it is considered that the acoustic signal does not include a speech signal, this variable value being determined during the last frame index k preceding said frame i and in which the temporary decision VAD (k) had a state "1" for detecting a speech signal. In this step 109, a mean maximum value Au is also stored which corresponds to the average maximum value of the discrete acoustic signal {A} in the sub-frame j of the frame i calculated as follows: Ai4 = 0 + (1 - where ai corresponds to the maximum of the discrete acoustic signal {A} contained in the theoretical frame k formed by the subframe j of the frame i and by at least one or more successive subframes that precede said subframe j; and 20 0 is a predefined coefficient between 0 and 1 with 0 <A. In this step 109, the maximum threshold value Lastmax is refreshed each time the method has considered that a sub-frame p of a frame k contains a signal of speech, by implementing the following procedure: the detection of a speech signal in the sub-frame p of the frame k follows a period of absence of speech, and in this case Lastmax takes the value discounted [a (Ak, p + LastMax)], where a is a predefined coefficient c omitted between 0 and 1, for example between 0.2 and 0.7; the detection of a speech signal in the sub-frame p of the frame k follows a speech presence period, and in this case Lastmax takes the updated value Ak, p if Ak, p> Lastmax. Then, in the step 110 described above, we add a condition based on the Lastmax maximum threshold value to set the threshold Cl. 30142 3 7 22 For each frame i, this condition is based on the comparison between: - the maximum value Lastmax threshold, and - the values [Kp.Ad and [Kp. where Kp is a fixed weighting coefficient of between 1 and 2. It is also conceivable to lower the maximum Lastmax threshold value after a given delay period (for example fixed between a few seconds and a few tens of seconds) between the frame i and the last frame of subscript k, to avoid the non-detection of speech 10 if the user / speaker lowers the amplitude of his voice significantly. Then, in a step 111, for each current frame i, the minimum rr (i) of a discrete detection function FDi (r) is calculated, in which the discrete detection function FDi (r) corresponds to either the difference function discrete Di (u) is the discrete normalized difference function DNi (r).
[0005] Finally, in a last step 112, for each current frame i, this minimum rr (i) is compared with the threshold Cl specific to the frame i, in order to detect the presence or absence of a speech signal (yes voiced signal). with: - if rr (i) Cl, then frame i is considered to have a speech signal and the method outputs an output signal DF; taking the value "1" 20 (that is, the final decision for the frame i is "presence of voice in the frame i"); if rr (i)> C1, then the frame i is considered as not having a speech signal and the method delivers an output signal DF; taking the value "0" (that is, the final decision for frame i is "no voice in frame i"). With reference to FIGS. 1 and 2, it is conceivable to provide an improvement to the method, by introducing an additional decision blocking step 113 (or hangover step), in order to avoid the interruption of sound in a sentence and during the pronunciation of the words. words, this step 113 of 30 decision blocking to strengthen the decision of presence / absence of voice by implementing the following two steps: - switching from a state of non-detection of a speech signal to a state of detecting a speech signal after detecting the presence of a speech signal on Np successive time frames; switching from a detection state of a speech signal to a state of non-detection of a speech signal after detecting no presence of a voiced signal on NA successive time frames. Thus, this blocking step 113 makes it possible to output a decision signal of the voice detection Dv which takes the value "1" corresponding to a decision of the voice detection and the value "0" corresponding to a decison of the non-detection of the voice, where: the decision signal of the detection of the voice Dv switches from a state "1" to a state "0" if and only if the output signal DF; takes the value "0" over NA 10 successive time frames i; and the decision signal of the detection of the voice Dv switches from a state "0" to a state "1" if and only if the output signal DF; takes the value "1" over Np successive time frames i. With reference to FIG. 2, assuming that a state "Dv = 1" is assumed, one switches to a state "Dv = O" if the output signal DF; takes the value "0" over NA successive frames, otherwise the state remains at "Dv = 1" (Ni representing the number of the frame at the beginning of the series). Similarly, if we assume that we start from a state "Dv = O", we switch to a state "Dv = 1" if the output signal DF; takes the value "1" over Np successive frames, otherwise the state remains at "Dv = 0". The final decision applies to the first H samples of the processed frame. Preferably NA is greater than Np, with for example NA = 100 and Np = 3, because it is better to risk detecting silence rather than cutting a conversation. The remainder of the description relates to two voice detection results obtained with a conventional method using a fixed threshold (FIG. 3) and with the method according to the invention using an adaptive threshold (FIG. 4). In FIGS. 3 and 4 (bottom), it will be noted that the two methods work on the same discrete acoustic signal {A}, with the amplitude as ordinates and the samples as abscissa. This discrete acoustic signal {A} has a single zone of presence of speech "PAR", and many areas of presence of noise such as music, drums, crowd cries and whistles. This discrete acoustic signal {A} reflects an environment representative of interpersonal communication (such as referees) within a stadium or gymnasium where the noise is relatively loud and is highly stationary. .
[0006] In FIGS. 3 and 4 (top), it should be noted that the two methods exploit the same function rr (i) corresponding to recalling at least the discrete detection function FDi (r) selected. In Fig. 3 (top), the minimum function rr (i) is compared to an optimally selected fixed threshold Ofix for voice detection. In Figure 3 (bottom), we note the shape of the output signal DF; which occupies a state "1" if rr (i) Ofix and a state "0" if rr (i)> Ofix. In FIG. 4 (top), the minimum function rr (i) is compared with an adaptive threshold C1 calculated according to the steps described above with reference to FIG. 1. In FIG. 4 (bottom), the form the output signal DF; which occupies a state "1" if rr (i) Cl and a state "0" if rr (i)> It is noted in FIG. 3 that the method according to the invention allows detection of the voice in the presence zone "PAR" speech with the DF output signal; which occupies a state "1", and that same output signal DF; several times occupies a state "1" in the other zones where the speech is however absent, which corresponds by false undesired detections with the conventional method. On the other hand, it is noted in FIG. 4 that the method according to the invention allows an optimal detection of the voice in the presence zone 20 of speech "PAR" with the output signal DF; which occupies a state "1", and that same output signal DF; occupies a state "0" in other areas where speech is absent. Thus, the method according to the invention provides voice detection with a sharp reduction in the number of false detections. Of course the implementation example mentioned above is not limiting in nature and further improvements and details can be made to the process according to the invention, without departing from the scope of the invention or other algorithms for calculating the detection function FD (T) can for example be used. 30
权利要求:
Claims (27)
[0001]
REVENDICATIONS1. A voice detection method for detecting the presence of speech signals in a noisy acoustic signal x (t) from a microphone, comprising the following steps: - calculating a FD (T) detection function based on on the computation of a difference function D (c) varying as a function of the offset T on an integration window of length W beginning at the time t0, with: D (r) - Et0 + W-1lx (n) - x ( n + T) I where 0 T max (T); n = t0 10 - search for the minimum of the detection function FD (T) and comparison of this minimum with a threshold, for T varying in a determined time interval said current interval to detect the presence or absence of a fundamental frequency Fo characteristic of a speech signal in said current interval; Said method being characterized in that it comprises, before the step of searching and comparing, a stage of adaptation of the threshold in said current interval, as a function of values calculated from the acoustic signal x (t) established in said current interval, and in particular maximum values of said acoustic signal x (t). 20
[0002]
2. Detection method according to claim 1, wherein the detection function FD (T) corresponds to the difference function D (c).
[0003]
3. The detection method according to claim 1, wherein the detection function FD (T) corresponds to the normalized difference function DN (T) calculated from the difference function D (c) as follows: DN (T) = 1 If T =, DN (T) = D (T) If T # O. (11T) T-1 Da) 30
[0004]
4. Method according to any one of the preceding claims, comprising a preliminary sampling step comprising a cutting of the acoustic signal x (t) into a discrete acoustic signal {A} composed of a sequence of vectors associated with time frames i of length N, Ncorresponding to the number of sampling points, where each vector translates the acoustic content of the associated frame i and is composed of N samples X (i-1) N + 1, X (i-1) N + 2 , --- XiN-1, XiN, i positive integer, so that: the calculation of the detection function FD (T) consists of a calculation of a discrete detection function FDi (r) associated with the frames i ; the adaptation of the threshold consists of, for each frame i, adapting a threshold C1 specific to the frame i as a function of reference values calculated from the values of the samples of the discrete acoustic signal {xi} in said frame i; the search for the minimum of the detection function FD (T) and the comparison of this minimum with a threshold are carried out by searching, on each frame i, the minimum rr (i) of the discrete detection function FDi (r) and comparing this minium rr (i) with the threshold Cl specific to the frame i.
[0005]
5. The method according to claim 4, wherein the discrete difference function Di (u) relative to the frame i is calculated as follows: the subframes i are subdivided into K subframes of length H, with for example [Nm H ax (T) 1 where K = or I_J represents the integral rounding operator, so that the samples of the discrete acoustic signal {xi} in a subfield of index p of the frame i include the samples: 20 x (i_i) N + (pi) H +1, X (i-1) N + (p-1) H + 2, ..., X (i-1) N + pH, p positive integer between 1 and K; for each subfield of index p, the following difference function ddp (r) is calculated: (T) = .0-1) N + pH dd v LJJ = (i-1) N + (p-1) H +11) (i Xi + TI) the discrete difference function Di (u) relative to the frame i is calculated as the sum of the difference functions ddp (c) of the sub-frames of index p of the frame i, namely: Di (T) = EpK = iddp (T).
[0006]
6. The method according to claims 3 and 5, in calculating the normalized difference function DN (T) consists of a calculation of a discrete normalized difference function DNi (r) associated with the frames i, where: DNi (T) = 1 If T, DNi (T) = i Di (t) If T 0. rD'a)
[0007]
7. A method according to any one of claims 4 to 6, wherein the step of adapting the thresholds C1 for each frame i comprises the following steps: a) dividing the frame i comprising N sampling points into T subframes of length L, where N is a multiple of T so that the length L = N / T is entire, and so that the samples of the discrete acoustic signal {A} in a subscript of subscript j of the frame i comprise the following L samples: X0_1) N + (j-1) L + 1, X (i-1) N + (j-1) L + 2, - - -, X (i-1) N- FjL, j positive integer between 1 and T; b) - the maximum values mu of the discrete acoustic signal {A} are calculated in each subfield of index j of the frame i, with: 15 mi, l = max {x (i-1) NF (0) L + 1, X (i-1) NF (j-1) L + 2, - - -, X (i-1) N + jLE c) - at least one reference value Refij, MRefi j specific to the subframe j of the frame i, the or each reference value Refij, MRefi j by subframe j being calculated from the maximum value mu in the sub-frame j of the frame i; D) - the value of the threshold Cl specific to the frame i is established as a function of all the reference values Refij, MRefi j calculated in the subframes j of the frame i.
[0008]
8. The method according to claim 7, wherein, in step c), the following substeps are performed on each frame i: c1) -on calculates the smoothed envelopes of maximum rrii, i in each subframe of index j of the frame i, with: mi j = + (1 -, where A is a predefined coefficient between 0 and 1; 30 c2) -on calculates the variation signals Di, j in each sub-frame d index j of the frame i, with: Ai, i = mi, i mi, i =, and where at least one so-called principal reference value Refi j by subframe j is calculated from the variation signal Ai j in the sub-frame j of the frame i.
[0009]
9. The method according to claim 8, wherein, in step c) and following substep c2), the following substeps are performed on each frame i: c3) -on calculates the maximums of variation s; in each subscript of index j of the frame i, where s;,; corresponds to the maximum of the variation signal Ai j calculated on a sliding window of length Lm prior to said subframe j, said length Lm being variable according to whether the sub-frame j of the frame i corresponds to a period of silence or presence of speech; c4) -on calculates the variations of variation b;,; in each subscript of index j of the frame i, with: Si4 = Ai4 - si4; 15 and wherein for each subframe j of the frame i, two main reference values Refi j are calculated from the variation signal Ai j and the variation difference b i, respectively.
[0010]
10. The method according to claim 9, wherein, in step c) and following substep c4), a sub-step c5) for calculating the normalized variation signals A'i j is carried out. and standardized deviations Es'i in each subscript of index j of the frame i, as follows: and where, for each subframe j of a frame i, the normalized variation signal j and the standardized variation deviation b '; each constitute a main reference value Refi j so that, during step d), the value of the threshold Cl specific to the frame i is set as a function of the torque (A'i j, j) of the standardized variation signals A'i j and standardized deviations of variation b ';,; in the 30 subframes j of the frame i. 1.1 m1.1 "30142 3 7 29
[0011]
11. The method of claim 10, wherein, in step d), the value of the threshold Cl specific to the frame i is established by partitioning the space defined by the value of the pair (0 ', j, j ), and by examining the value of the pair (0 ', j, j) on one or more successive subframes according to the value area of the pair (0', j, j).
[0012]
12. The method as claimed in claim 9, in which, during the sub-step c3), the length Lm of the sliding window corresponds to the following equations: Lm = LO if the sub-frame j of the frame i corresponds to a period of silence; - Lm = L1 if the sub-frame j of the frame i corresponds to a period of presence of speech; with L1 <LO, and in particular with L1 = k1.L and L0 = k0.L, where L is the length of the subframes of index j and kO, k1 being positive integers.
[0013]
13. The method according to claim 10, wherein, in the sub-step c3), for each calculation of the maximum variation if j in the sub-frame j of the frame i, the sliding window of length Lm is delayed by M frames N-length with respect to said subframe j.
[0014]
14. The method of claims 10 and 13, wherein, in the sub-step c3), the normalized maximums of variation s' are also calculated; in each subfield of index j of the frame i, where s ', j is the maximum of the normalized variation signal 0'; i calculated on a sliding window of length Lm prior to said subframe j, where: - = Sij and where each maximum normalized variation s'; , j is calculated according to a minimization method comprising the following iterative steps: - computation of = max; ~~ i -Mm, j} and = max; ~~ i -Mm, j} - if rem (i, Lm) = 0, where rem is the remainder operator of the integer division of two integers, then: Sij = max; i-mm, i with s'0,1 = 0 and g'0,1 = 0; and wherein, in step c4), the normalized deviations j are calculated in each subfield of index j of the frame i, as follows:
[0015]
15. Method according to any one of claims 8 to 14, wherein, in step c), a substep c6) is performed in which the maximums of maximum q; in each subscript of index j of the frame i, where q; corresponds to the maximum of the maximum value mu calculated on a sliding window of fixed length Lq before said sub-frame j, where the sliding window of length Lq is delayed by Mq frames of length N with respect to said sub-frame. frame j, and wherein another so-called secondary reference value MRefi j per subframe j corresponds to said maximum maximum q;,; in the sub-frame j of the frame i.
[0016]
16. A method according to any one of claims 8 to 15, wherein, in step d), the threshold Cl specific to the frame i is divided into several sub-thresholds Qu specific to each sub-frame j of the frame i, and the value of each sub-threshold Qu is at least set according to the reference value or values Refi j, MRefi j calculated in the sub-frame j of the corresponding frame i.
[0017]
17. A method according to claims 10 and 16, wherein, in step d), the value of each threshold Qu specific to the field i subframe j is determined by comparing the values of the pair (A'i j , j) with several pairs of fixed thresholds, the value of each threshold Qu being selected from a plurality of fixed values as a function of the comparisons of the pair (A'i j, j) with said pairs of fixed thresholds.
[0018]
18. A method according to any one of claims 8 to 17, wherein, in step d), a so-called decision procedure is performed comprising the following substeps for each frame i: - for each subframe j of the frame i, a decision index DECi (j) is established which occupies either a state "1" of detection of a speech signal or a state "0" of detection of a speech signal; a temporary decision VAD (i) is established based on the comparison of decision indices DECi (j) with logical "OR" operators, so that the temporary decision VAD (i) occupies a state "1" of detection of a speech signal if at least one of said decision indices DECi (j) occupies this state "1" of detection of a speech signal.
[0019]
19. The method according to claims 16 and 18, wherein, during the decision procedure, the following substeps are carried out, for each frame i: a maximum value of Lastmax threshold is stored which corresponds to the variable value of a comparison threshold for the amplitude of the discrete acoustic signal {A} below which it is considered that the acoustic signal does not comprise a speech signal, this variable value being determined during the last frame of index k which precedes said frame i and wherein the temporary decision VAD (k) has a state "1" of detection of a speech signal; a mean maximum value Au is stored which corresponds to the average maximum value of the discrete acoustic signal {A} in the sub-frame j of the frame i calculated as follows: Ai4 = 0Ai4_1 + (1 - where ai corresponds to the maximum of discrete acoustic signal {A} contained in a frame formed by the subframe j of the frame i and by at least one or more successive subframes that precede said subframe j; and 0 is a predefined coefficient between 0 and 1 with 0 <À; - the value of each sub-threshold Qu is established as a function of the comparison between said maximum threshold value Lastmax and average maximum values Au and considered over two successive sub-frames j and j-1. 30
[0020]
20. The method according to claim 19, wherein, during the decision procedure, the maximum Lastmax threshold value is updated each time the process has considered that a sub-frame p of a frame k contains a speech signal. by carrying out the following procedure: - the detection of a speech signal in the sub-frame p of the frame k follows a period of absence of speech, and in this case Lastmax takes the present value [ a (Ak, p + LastMax)], where a is a predefined coefficient between 0 and 1, and for example between 0.2 and 0.7; the detection of a speech signal in the sub-frame p of the frame k follows a period of presence of speech, and in this case Lastmax takes the updated value Ak, p if Ak, p> Lastmax. 10
[0021]
21. The method according to claim 19 or 20, wherein, the value of the threshold C1 is established as a function of said maximum value Lastmax based on the comparison between: the maximum threshold value Lastmax; and [Kp.Aii] and [Kp. Ai, 0], where Kp is a fixed weighting coefficient between 1 and 2.
[0022]
22. A method according to any one of claims 4 to 21, further comprising a so-called blocking phase comprising a step of switching from a state of non-detection of a speech signal to a state of detection of a signal. speech after detecting the presence of a speech signal on Np successive time frames i.
[0023]
23. A method according to any one of claims 4 to 22, further comprising a so-called blocking phase comprising a step of switching from a detection state of a speech signal to a state of non-detection of a signal. of speech after detecting no presence of a voiced signal on NA successive time frames i. 30
[0024]
24. A method according to any one of claims 22 and 23, further comprising a step of interrupting the blocking phase in decision areas intervening at the end of words and in a non-noisy situation, said decision areas being detected by analyzing the minimum rr (i) of the discrete detection function FDi (r). 35
[0025]
25. Computer program, characterized in that it comprises code instructions able to control the execution of the steps of the voice detection method according to any one of the preceding claims when executed by a processor.
[0026]
26. Data recording medium on which the computer program according to claim 25 is stored.
[0027]
27. Provision of a program according to claim 25 on a telecommunication network for download.
类似技术:
公开号 | 公开日 | 专利标题
JP6694426B2|2020-05-13|Neural network voice activity detection using running range normalization
KR100636317B1|2006-10-18|Distributed Speech Recognition System and method
KR101060533B1|2011-08-30|Systems, methods and apparatus for detecting signal changes
EP0867856A1|1998-09-30|Method and apparatus for vocal activity detection
EP3078027B1|2018-05-23|Voice detection method
Palomäki et al.2004|Techniques for handling convolutional distortion withmissing data'automatic speech recognition
EP2596496B1|2016-10-26|A reverberation estimator
US20110029310A1|2011-02-03|Procedure for processing noisy speech signals, and apparatus and computer program therefor
FR3002679A1|2014-08-29|METHOD FOR DEBRUCTING AN AUDIO SIGNAL BY A VARIABLE SPECTRAL GAIN ALGORITHM HAS DYNAMICALLY MODULABLE HARDNESS
WO2019232867A1|2019-12-12|Voice discrimination method and apparatus, and computer device, and storage medium
US9928852B2|2018-03-27|Method of detecting a predetermined frequency band in an audio data signal, detection device and computer program corresponding thereto
WO2003048711A2|2003-06-12|Speech detection system in an audio signal in noisy surrounding
US20190057705A1|2019-02-21|Methods and apparatus to identify a source of speech captured at a wearable electronic device
EP3192073B1|2018-08-01|Discrimination and attenuation of pre-echoes in a digital audio signal
Kumar2018|Comparative performance evaluation of MMSE-based speech enhancement techniques through simulation and real-time implementation
WO2000031728A1|2000-06-02|Speech recognition method in a noisy acoustic signal and implementing system
KR100766170B1|2007-10-10|Music summarization apparatus and method using multi-level vector quantization
CN110827795A|2020-02-21|Voice input end judgment method, device, equipment, system and storage medium
EP3627510A1|2020-03-25|Filtering of an audio signal acquired by a voice recognition system
FR2988894A1|2013-10-04|Method for detection of voice to detect presence of word signals in disturbed signal output from microphone, involves comparing detection function with phi threshold for detecting presence of absence of fundamental frequency
Chelloug et al.2016|An efficient VAD algorithm based on constant False Acceptance rate for highly noisy environments
WO2011012789A1|2011-02-03|Source location
Chermaz et al.2021|Compressed Representation of Cepstral Coefficients via Recurrent Neural Networks for Informed Speech Enhancement
FR2997250A1|2014-04-25|DETECTING A PREDETERMINED FREQUENCY BAND IN AUDIO CODE CONTENT BY SUB-BANDS ACCORDING TO PULSE MODULATION TYPE CODING
WO2006117453A1|2006-11-09|Method for attenuation of the pre- and post-echoes of a digital audio signal and corresponding device
同族专利:
公开号 | 公开日
WO2015082807A1|2015-06-11|
ES2684604T3|2018-10-03|
EP3078027B1|2018-05-23|
FR3014237B1|2016-01-08|
CA2932449A1|2015-06-11|
US9905250B2|2018-02-27|
CN105900172A|2016-08-24|
EP3078027A1|2016-10-12|
US20160284364A1|2016-09-29|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题
US20090076814A1|2007-09-19|2009-03-19|Electronics And Telecommunications Research Institute|Apparatus and method for determining speech signal|
FR2988894A1|2012-03-30|2013-10-04|Adeunis R F|Method for detection of voice to detect presence of word signals in disturbed signal output from microphone, involves comparing detection function with phi threshold for detecting presence of absence of fundamental frequency|
FR2825505B1|2001-06-01|2003-09-05|France Telecom|METHOD FOR EXTRACTING THE BASIC FREQUENCY OF A SOUND SIGNAL BY MEANS OF A DEVICE IMPLEMENTING A SELF-CORRELATION ALGORITHM|
FR2899372B1|2006-04-03|2008-07-18|Adeunis Rf Sa|WIRELESS AUDIO COMMUNICATION SYSTEM|
US8812313B2|2008-12-17|2014-08-19|Nec Corporation|Voice activity detector, voice activity detection program, and parameter adjusting method|
FR2947124B1|2009-06-23|2012-01-27|Adeunis Rf|TEMPORAL MULTIPLEXING COMMUNICATION METHOD|
FR2947122B1|2009-06-23|2011-07-22|Adeunis Rf|DEVICE FOR ENHANCING SPEECH INTELLIGIBILITY IN A MULTI-USER COMMUNICATION SYSTEM|
US8949118B2|2012-03-19|2015-02-03|Vocalzoom Systems Ltd.|System and method for robust estimation and tracking the fundamental frequency of pseudo periodic signals in the presence of noise|
FR3014237B1|2013-12-02|2016-01-08|Adeunis R F|METHOD OF DETECTING THE VOICE|FR3014237B1|2013-12-02|2016-01-08|Adeunis R F|METHOD OF DETECTING THE VOICE|
US10621980B2|2017-03-21|2020-04-14|Harman International Industries, Inc.|Execution of voice commands in a multi-device system|
CN107248046A|2017-08-01|2017-10-13|中州大学|A kind of moral and political science Classroom Teaching device and method|
JP6904198B2|2017-09-25|2021-07-14|富士通株式会社|Speech processing program, speech processing method and speech processor|
法律状态:
2015-10-15| PLFP| Fee payment|Year of fee payment: 3 |
2016-10-27| PLFP| Fee payment|Year of fee payment: 4 |
2017-09-18| PLFP| Fee payment|Year of fee payment: 5 |
2018-10-30| PLFP| Fee payment|Year of fee payment: 6 |
2020-10-16| ST| Notification of lapse|Effective date: 20200914 |
2021-02-26| RG| Lien (pledge) cancelled|Effective date: 20210114 |
优先权:
申请号 | 申请日 | 专利标题
FR1361922A|FR3014237B1|2013-12-02|2013-12-02|METHOD OF DETECTING THE VOICE|FR1361922A| FR3014237B1|2013-12-02|2013-12-02|METHOD OF DETECTING THE VOICE|
PCT/FR2014/053065| WO2015082807A1|2013-12-02|2014-11-27|Voice detection method|
US15/037,958| US9905250B2|2013-12-02|2014-11-27|Voice detection method|
CA2932449A| CA2932449A1|2013-12-02|2014-11-27|Voice detection method|
EP14814978.4A| EP3078027B1|2013-12-02|2014-11-27|Voice detection method|
CN201480065834.9A| CN105900172A|2013-12-02|2014-11-27|Voice detection method|
ES14814978.4T| ES2684604T3|2013-12-02|2014-11-27|Voice Detection Procedure|
[返回顶部]