比利时专利BE1023458B1 Method and system for generating an optimized voice recognition solution

专利PDF首页>>比利时专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
A method for generating a speech recognition optimized solution (105) and comprising the steps of: receiving a plurality of speech recognition results (100), each comprising one or more elements (113) and each being a transcript of a message (50). ) said by a speaker (40); selecting, using a selection rule, one or more elements (113) of one or more results (100) of said plurality of speech recognition results (100); generating said optimized solution (105) from the at least one selected element (113).
公开号:BE1023458B1
申请号:E2016/5154
申请日:2016-03-02
公开日:2017-03-27
发明作者:Jean-Luc Forster
申请人:Zetes Industries Sa；
IPC主号:

专利说明:

Method and system for generating an optimized voice recognition solution
Field of the Invention [0001] The invention relates to the field of speech recognition. In particular and according to a first aspect, the invention relates to a method for generating an optimized solution in voice recognition. According to a second aspect, the invention relates to a system (or device). According to a third aspect, the invention relates to a program. According to a fourth aspect, the invention relates to a storage medium comprising instructions (for example: USB key, CD-ROM type disk or DVD). State of the art [0002] A speech recognition engine makes it possible to provide, from a spoken or audio message, a result that is generally in the form of text or code that can be used by a machine. A speech recognition result can therefore be seen as a transcription of a spoken message. Speech recognition is now widespread and is considered very useful. Various applications of speech recognition are described in US6,754,629B1.
A voice recognition result can be used for example to enter information in a computer system, for example an article number or any instruction to be made. A speech recognition result generally comprises a series of elements, for example words, separated by silences or by time intervals that do not include words recognized by the speech recognition engine. Such a result is characterized by a beginning and an end and its elements are temporally arranged between this beginning and this end. Some elements of a speech recognition result may be valid or correct, while others may be invalid or incorrect. For example, if a speaker says the word 'uh', a speech engine can provide for that spoken word, the word 'two'.
In general, a speech recognition engine provides, for a spoken message, several assumptions. In this case, the speech recognition engine also generally provides a confidence score associated with each hypothesis. Such a confidence score is usually determined from confidence scores associated with the different elements of the assumptions. A hypothesis with a high confidence rate is a better hypothesis than a hypothesis with a low confidence score.
[0005] Based on various hypotheses provided by a voice recognition engine, the one with the highest confidence score is generally selected, hoping that such a high confidence score assumption includes the largest number of votes. valid or correct elements. The inventors have found that such a procedure does not always provide the best performance. In particular, the elements of the hypothesis selected are not necessarily valid. On the other hand, the number of valid or correct elements of the hypothesis selected on the basis of confidence scores related to the hypotheses is not always satisfactory. The inventors have therefore sought to develop a post-processing method making it possible to provide, from a plurality of speech recognition results, a better solution, that is to say a post-processed solution whose elements have more chances of being valid or correct. SUMMARY OF THE INVENTION [0006] According to a first aspect, one of the aims of the invention is to provide a method for providing a solution whose elements are more likely to be valid. For this purpose, the inventors propose the following method. A method for generating an optimized solution in voice recognition and comprising the steps of: A. receiving (or obtaining, reading) a plurality of speech recognition results, each comprising one or more items and each being a solution of a message said by a speaker; B. selecting one or more items belonging to one or more of said plurality of voice recognition results of step A. using a selection rule; C. generating said optimized solution from the at least one element selected in step B.
Thus, the inventors propose to use the content of the different results of the plurality of results to generate an optimized solution. With the selection rule of step B., one or more elements of one or more of said results that are most likely to be valid can be selected, and at the same time, not to keep one or more elements that are likely to be invalid. The inventors therefore propose to exploit the content of the various results to obtain an optimized solution. The method of the invention is more reliable and more efficient because the elements of said optimized solution are more likely to be valid or correct.
The method of the invention has other advantages. It is easy to implement and does not require complex post-processing operations. It is also flexible because different selection rules can be used in step B. depending for example on the identity of the speaker and / or the type of elements of the results.
[0009] A speech recognition result is generally in the form of a text or a code that can be used by a machine. An element of a result represents information of the result delimited by two different times along a time scale, t, associated with the result, and which is not considered as a silence, or a background noise for example. In general, two elements of a speech recognition result are separated by a time interval during which the speech recognition engine does not recognize an item. Preferably, an element is a word. Examples of words are: one, two, car, umbrella. According to this preferred variant, the method of the invention gives even better results. Each word is determined from a message said by a user by a speech recognition engine using a dictionary. Grammar rules may reduce the choice of possible words from a dictionary. Preferably, a speech recognition result is a hypothesis provided by voice recognition engine.
[0010] Preferably, an element is a word, for example "a", "two". An element can also represent a group or combination of words.
Preferably, the elements of said voice recognition results of step A. belong to the same vocabulary. By vocabulary, we mean a set of words for example. One could also use the word 'lexicon' instead of the word 'vocabulary'. According to this preferred variant, the elements of the results of step A. are, for example, numbers, common names, proper names, numbers between one hundred and two hundred, colors. Other examples are nevertheless possible.
Preferably, the elements of said voice recognition results of step A. are words. By using this preferred variant, the method of the invention is even more reliable and more efficient because the elements of the optimized solution are even more likely to be valid or correct.
Preferably, the voice recognition results of step A. are each a solution of the same message said by said speaker, provided by a voice recognition engine. The speech recognition results then generally represent different hypotheses provided by a voice recognition engine for this message. With this preferred version, we can take advantage of the content of these different hypotheses to obtain an optimized solution. Such a variant is particularly easy to implement because it uses results that are generally provided by a voice recognition engine.
According to another possible variant, the voice recognition results of step A. are solutions provided by a voice recognition engine from different repeated messages, the different repeated messages relating to the same content but not being not exactly identical. Ideally, these different repeated messages are strictly the same. But in practice, an operator never says the same content several times in the same way. In particular, it may be prone or mistaken for some spoken elements of one or more of the repeated messages.
Preferably, each of said plurality of speech recognition results of step A. is characterized by a confidence score greater than or equal to a minimum confidence score. In general, the confidence score associated with each speech recognition result (each hypothesis provided by a voice recognition engine for example) is provided by a voice recognition engine. In this case, a high confidence score is a better result than a low confidence score. For this preferred embodiment, there is therefore a preliminary step of selecting one or more results provided by a speech recognition engine that has a confidence score greater than or equal to a minimum confidence score. This has the advantage of further improving the reliability and efficiency of the method of the invention because the optimized solution is generated from results which have passed a first filter and which are retained only if the confidence score ( overall) associated with them is considered sufficient.
Preferably, the voice recognition results of step A. are hypotheses provided by a voice recognition engine. Preferably, the different hypotheses are obtained from the same message said by a speaker.
According to another possible variant, each result of said plurality of voice recognition results of step A. is a post-processed solution of a message said by a speaker. Such a post-processed solution is for example obtained keeping only elements considered valid in a solution provided by a voice recognition engine from a message said by a speaker. Preferably, such a post-processed solution is obtained by keeping only consecutive valid elements of a solution provided by a voice recognition engine from a message said by a speaker, preferably from the end of said solution. An example of a solution is a hypothesis provided by a speech recognition engine.
The selection rule of step B. may comprise different steps which furthermore may be combined. Preferably, said selection rule of step B. comprises a step of selecting an element if it is identical to at least two results of said plurality of results of step A. and if it is included, in a margin of errors, in the same time interval, along a time scale, t, associated with each of said two results. With this preferred variant, it is possible to use a possible redundancy of one or more elements in several results as a selection criterion. We optimize the information provided by the different results. For example, we would select the word 'three' which would be more or less the same temporal place in three results. Each result has a beginning and an end. The different elements of a result are distributed between this beginning and this end. An element represents information of a result delimited by two different times along a time scale, t, associated with the result. Preferably, the margin of error means that the beginning and the end of the element, which is common to at least two results, are the same to within 20% (more preferably to within 10%), the reference for calculate the percentage being a value of either the beginning and the end of such an element in a result.
Preferably, it is also necessary for an element to have a confidence level greater than or equal to a minimum confidence level for it to be selected.
Preferably, said selection rule of step B. comprises a step of selecting an element if its duration is greater than or equal to a threshold of shorter duration. Each element of a result corresponds to a duration or time interval which is generally provided by the speech recognition engine. With this preferred embodiment, it is possible to overcome more effectively elements that are short-lived, such as a parasitic noise that may be from a machine.
Preferably, said selection rule of step B. comprises a step of selecting an element if its duration is less than or equal to a threshold of greater duration. With this preferred embodiment, it is possible to overcome more effectively the elements which are of long duration, for example a hesitation of a speaker who says for example 'uh' but for which the voice recognition engine provides the word 'two'. By using this preferred embodiment, it will be easier to eliminate (not select) the word 'two' invalid.
Preferably, each element of said results of step A. is characterized by a confidence rate and said selection rule of step B. comprises a step of selecting an element if its confidence rate is greater than or equal to at a minimum confidence level. The reliability of the method of the invention is further increased in this case. The concept of 'confidence rate' is known to a person skilled in the art. This is a property or statistic associated with an element and can be provided by a speech recognition engine in general. A confidence rate is, in general, a probability that an item determined by a voice recognition engine from a spoken item is the correct one. An example of a voice recognition engine is the VoCon® 3200 V3.14 model from Nuance. In this case, the confidence rate varies between 0 and 10 000. A value of 0 refers to a minimum value of a confidence rate (very low probability that the element of the speech recognition result is the correct one) and 10 0000 represents a maximum value of a confidence rate (very high probability that the element of the speech recognition result is the correct one).
Preferably: each element of said results of step A. is characterized by a confidence rate, and said selection rule of step B. comprises a step of selecting an element of a result if its confidence rate is greater than or equal to the confidence rate of one or more other elements of one or more other results and understood, within a margin of error, in the same time interval as said element, along a time scale, t, associated with each of the results.
Each result has a beginning and an end; the different elements of the result are distributed between this beginning and this end. An element represents information of a result delimited by two different times along a time scale, t, associated with said result. Preferably, said margin of error means that the beginning and the end of the element in the different results are the same to within 20% (more preferably to within 10%), the reference for calculating the percentage being the value of the beginning and / or end of an element of a result.
[0023] Preferably, said selection rule of step B. comprises a step of selecting an element of a result if a time interval separating it from another directly adjacent element and belonging to the same result is greater than or equal to a minimum time interval.
With this preferred embodiment, it is possible to reject more efficiently elements that are not generated by a human being but rather by a machine for example and which are temporally very close together. The other directly adjacent element may be located towards the end or the beginning of said result with respect to the element to which the selection rule is applied.
[0024] Preferably, said selection rule of step B. comprises a step of selecting an element of a result if a time interval separating it from another directly adjacent element and belonging to the same result is less than or equal to a maximum time interval. With this variant, it is possible to more effectively reject elements that are temporally greatly separated from each other. The other directly adjacent element may be located towards the end or the beginning of said result with respect to the element to which the selection rule is applied.
[0025] Preferably, said selection rule of step B. comprises a step of selecting, for a given speaker, an element of a result, if a statistic associated with this element of this result is compliant, in an interval close to a pre-established statistic for the same element and for this given speaker. The statistic associated with said element is typically provided by a speech recognition engine. Examples of statistics associated with an element are: the duration of the element, its confidence rate. Other examples are possible. It is possible to record such statistics for different elements and for different speakers (or operators), for example during a preliminary enrollment step. If the identity of the speaker having recorded a message to which a result provided by a voice recognition engine corresponds is then known, it is possible to compare statistics associated with the different elements of said result with predetermined statistics for these elements and for this speaker. In this case, the method of the invention therefore preferably comprises an additional step for determining the identity of the speaker.
With this preferred embodiment, the reliability and efficiency is further increased because it is possible to take into account the vocal specificities of the speaker. In particular, according to this preferred embodiment, the inventors do not propose to use a threshold value common to all the speakers and all the elements, to select an element in step B. The inventors propose rather to use the profile of the speaker. It may be that for certain elements and for a given speaker, the confidence rates associated with these elements are generally low, whereas the elements are valid and should therefore be selected in step B. This could happen in particular if the accent of the speaker is particularly marked for certain elements, so that their pronunciation by the speaker moves away strongly from the phonetic dictionary of the speech recognition engine which provides low confidence rates for these elements. Using an absolute confidence rate, common for all speakers and for all elements does not provide good results in this case.
Preferably, said interval near the preferred embodiment of the preceding paragraph is between -20% and + 20% relative to the predetermined or predetermined statistic (its value). Thus, in this preferred case, it will be deduced that an element is valid if: (value of the statistic of the element) = (pre-established value of the statistic for the same element) +/- 20%. In particular, if the preestablished statistic is worth 100 ms (case where the statistic corresponding to a duration), one will deduce that an element is valid if its duration is between 80 and 120 ms. More preferably, said interval is between -10% and + 10% with respect to the predetermined statistic. Even more preferably, said interval is between -5% and + 5% with respect to the predetermined statistic.
The selection rule of step B. may comprise a step of selecting, for a given speaker, an element of a result, if several statistics associated with this result element are compliant, in one (or more) interval (s), to several pre-established (or predetermined) statistics for the same element and for that given speaker. For example, it can be concluded that an element must be selected if all associated statistics correspond, within a range, to the predetermined statistics. A statistic associated with an element of a result is provided for example by a voice recognition engine.
[0028] Preferably, an element of a result, having a statistic that is the closest to the predetermined one for the same element, will be chosen if several elements of several results are a priori compliant.
It is possible to obtain for a speaker one (or more) predetermined statistics in different ways. For example, one can imagine having a preliminary step of recording a calibration message by the speaker. Preferably, this message comprises a plurality of elements (words for example) that said speaker must say. For the content of the message, it will preferably choose a content including elements that said speaker is supposed to say later, during handling operations for example. During this preliminary step, one or more statistics associated with each of said plurality of elements are then recorded, these values being able to be provided by a speech recognition engine to retranscribe the calibration message. It is first of all ensured that the elements of the voice recognition result (which is derived from the calibration message), for which the value or values of one or more parameters are recorded, correspond to the 'true' elements of the so-called calibration message. by the speaker. If this is the case, for example, the duration of each of the elements of the result resulting from the calibration message said by the speaker is recorded. According to another example, a confidence rate associated with each element of the result resulting from the calibration message is recorded. The values of such parameters, duration and confidence rate, can be provided by a voice recognition engine. Thanks to this preliminary step, it is possible to easily provide the pre-established statistics.
[0030] An identity of a speaker can be determined in different ways. According to a first possible variant, it is determined from a session opening of the speaker who is asked to enter his name in a computer system. According to another possible variant, a voice recognition engine from which said result is derived is able to recognize the identity of said speaker saying a message. Other variants are possible.
Preferably, different steps as described above are used in combination for the selection rule of step B. In particular, one could imagine selecting an element of a speech recognition result if its duration is greater or equal to a threshold of shorter duration and if another statistic (for example its confidence rate) is, within an interval and for the same speaker, to the same predetermined statistic for the same element. Other combinations are possible for the selection rule of step B.
According to a second aspect, the inventors propose a system for generating an optimized solution in voice recognition and comprising: acquisition means for receiving a plurality of speech recognition results, each comprising one or more elements and each being a transcription of a message by a speaker; processing means for: selecting, using a selection rule, one or more elements of one or more results of said plurality of speech recognition results; o generating said optimized solution from the at least one selected element.
The advantages associated with the method according to the first aspect of the invention apply to the system of the invention, mutatis mutandis. Thus, in particular, it is possible to have a more reliable and more effective solution with the system of the invention because the elements of said solution are more likely to be valid or correct. The various embodiments presented for the method according to the first aspect of the invention, apply to the system of the invention, mutatis mutandis.
According to a third aspect, the invention relates to a program (preferably a computer program) for generating an optimized solution in speech recognition and comprising a code to enable a device to perform the steps. following: A. receive (preferably read) a plurality of speech recognition results, each comprising one or more elements and each being a transcription of a speaker's spoken message; B. selecting one or more items belonging to one or more of said plurality of voice recognition results of step A. using a selection rule; C. generating said optimized solution from the at least one element selected in step B.
The advantages associated with the method and the system according to the first and second aspects of the invention apply to the program of the invention, mutatis mutandis. Thus, in particular, it is possible to have a more reliable and more effective solution with the system of the invention, because the elements of said solution are more likely to be valid or correct. The various embodiments presented for the method and the system according to the first and second aspects of the invention, apply to the program of the invention, mutatis mutandis.
According to a fourth aspect, the invention relates to a storage medium that can be connected to a device and comprising instructions, which are read, that make it possible to impose on said device the following steps: A. receive (from preferably read) a plurality of speech recognition results, each comprising one or more elements and each being a transcription of a speaker's spoken message; B. selecting one or more items belonging to one or more of said plurality of voice recognition results of step A. using a selection rule; C. generating said optimized solution from the at least one element selected in step B.
The advantages associated with the method, system and program according to the first, second and third aspects of the invention apply to the storage medium of the invention, mutatis mutandis. Thus, in particular, it is possible to have a more reliable and more efficient solution with the storage medium of the invention, because the elements of said solution are more likely to be valid or correct. The various embodiments presented for the method, system and program according to the first, second and third aspects of the invention, apply to the storage medium of the invention, mutatis mutandis. Preferably, said device is a speech recognition engine or a computer that can communicate with a voice recognition engine.
BRIEF DESCRIPTION OF THE FIGURES [0035] These aspects as well as other aspects of the invention will be clarified in the detailed description of particular embodiments of the invention, reference being made to the drawings of the figures, in which: FIG. schematically shows a speaker saying a message that is processed by a speech recognition engine; Fig.2 shows schematically an example of a result from a speech recognition engine; Fig. 3 shows three speech recognition results; Fig.4 shows schematically an example of a system according to the invention. The drawings of the figures are not to scale. Generally, similar elements are denoted by similar references in the figures. The presence of reference numbers in the drawings can not be considered as limiting, even when these numbers are indicated in the claims.
DETAILED DESCRIPTION OF PARTICULAR EMBODIMENTS [0036] FIG. 1 shows a speaker 40 (or user 40) saying a message 50 to a microphone 5. This message 50 is then transferred to a voice recognition engine 10 which is known to a user. skilled person. Different models and different brands are available on the market. In general, the microphone 5 is part of the speech recognition engine 10. The latter processes the message 50 with speech recognition algorithms, based for example on a hidden Markov model (MMC). It results in a result 100 from the voice recognition engine 10. An example of result 100 is a hypothesis generated by the voice recognition engine 10. Another example of result 100 is a solution obtained from the speech recognition algorithms and to from postprocessing operations which are for example applied to one or more hypotheses generated by the voice recognition engine 10. Postprocessing modules to provide such a solution can be part of the voice recognition engine 10. The result 100 is generally under the form of a text that can be deciphered by a machine, a computer or a processing unit for example. A result of speech recognition 100 can therefore be seen as a solution of a spoken message, said by a speaker 40.
FIG. 2 shows an example of result 100. It comprises a beginning 111 and a end 112. The beginning 111 is before said end 112 along a time scale, t. The result 100 generally comprises a plurality of elements 113 temporally distributed between the beginning 111 and the end 112. An element 113 represents information between two different times along the time scale, t. In general, an element 113 is a portion of the result 100 that is not considered silence and / or background noise. An example of element 113 is a word. In general, the various elements 113 are separated by portions of the result 100 representing a silence, a background noise, or a time interval during which no element 113 (word for example) is recognized by the speech recognition engine 10 .
The message 50 includes a certain number of spoken elements, for example spoken words such as: one, two, car, umbrella. A voice recognition result 100 comprises a number of elements 113 (or words). These elements 113 are transcribed by a voice recognition engine 10 from the message 50 comprising the spoken elements, generally in a format readable by a human or a computer for example. If the message 50 includes the following spoken elements: one, two, car, umbrella, the 100 speech recognition result ideally includes the following 113 elements: one, two, car, umbrella.
Between its beginning 111 and its end 112, the result 100 comprises several elements 113, seven in the case illustrated in Figure 2. In this figure, the elements 113 are represented as a function of time, t (abscissa). The ordinate, C, represents a level or rate of confidence. This concept is known to a person skilled in the art. This is a property or statistic generally associated with each item 113 and can be provided by a speech recognition engine 10 in general. A confidence rate is, in general, a probability that a written item, determined by a speech recognition engine 10 from a spoken item, is the correct one. An example of a speech recognition engine 10 is the VoCon® 3200 V3.14 model from Nuance. In this case, the confidence rate varies between 0 and 10 000. A value of 0 relates to a minimum value of a confidence rate (very low probability that the element 113 of the speech recognition result 100 is the right one) and 10 0000 represents a maximum value of a confidence rate (very high probability that the item 113 of the speech recognition result 100 is the correct one). As a function of the height of an element 113 in FIG. 2, its confidence level 160 is higher or lower.
The method of the invention makes it possible to generate an optimized solution 105 in voice recognition. The first step, step A., consists of receiving (preferably read) a plurality of voice recognition results 100. Preferably, step A. consists of receiving three, four, or five voice recognition results. Preferably, these voice recognition results 100 are hypotheses provided by a voice recognition engine 10. Next, step B. consists in selecting one or more elements 113 of one or more results 100, using a selection rule. . This selection rule may, for example, lead to the selection of an element 113 of a result 100. In another example, the selection rule leads to selecting several elements 113 of a single result 100. According to yet another example, the rule of selection selection leads to selecting several items 113 among several results 100.
In this case, it is possible for each selected element 113 to belong to a different result 100 or for several selected elements 113 to belong to the same result 100. If the selection rule leads to selecting two elements 113, a possible example of selection is select an element 113 of a result 100 and another element 113 of another result 100. From the at least one element 113 selected in step B., an optimized solution 105 is generated in step C ..
In general, a speech recognition engine 10 provides, for each result 100 it generates, a confidence score. In this case, the method of the invention preferably comprises a preliminary step of selecting one or more results 100 provided by a voice recognition engine 10 that has a confidence score greater than or equal to a minimum confidence score. For this preferred embodiment, the optimized solution 105 is thus generated from results 100 which have passed a first filter and which are retained only if the overall confidence score associated with them is considered sufficient. Preferably, the minimum confidence score is 5000. More preferably, the minimum confidence score is 6000 (values preferred for the Nuance VoCon® 3200 V3.14 model but which can be adapted to other models of recognition engines. voice 10). Other preferred values are nevertheless possible.
The selection rule of step B may comprise different steps. Figure 3 shows an example of three speech recognition results 100: a first 101, a second 102 and a third 103 result. The first result 101 includes a first 1131, a second 1132 and a third 1133 elements. The second result 102 comprises a first 1131, a second 1132 and a third 1133 elements. The third result 103 comprises two first 1131, a second 1132 and a third 1133 elements. As illustrated in FIG. 3, the different elements 113 of the three results (101, 102, 103) preferably have the same beginnings and ends, except for the first two elements 1131 of the third result 103. Imaginons that the elements 113 of the different results (101, 102, 103) are the words mentioned in the table below.
The underlined words are the elements selected by the selection rule of step B. According to this example, the first elements 1131 of the first 101 and second 102 results (seven and six) are not retained because their duration 150 is greater than one. threshold of longer duration. On the other hand, the first elements 1131 of the third result 103 (six and seven) are selected because their duration 150 is less than a threshold of greater duration. The second element 1132 of the first 101, second 102 and third 103 result is identical. It is therefore selected. The third element 1133 of the first result 101 is selected because its confidence level 160 is greater than a minimum confidence level 161. Finally, the elements 113 of the optimized solution 105 are those of the table above. According to this example, the selection rule therefore comprises three different types of criteria: one based on the duration 150 of the elements 113, one based on the redundancy of an element 113 and a third based on the confidence rate 160 associated with an element 113. Other types of criteria may be used.
Preferably, one or more statistics associated with each element 113 of each result 100 are compared to such statistics which are predetermined or predetermined for the speaker 40 who dictated the message or messages 50 from which the results 100 and which are also pre-established for certain elements 113. For example, taking the case illustrated in FIG. 3, rather than using an absolute value of a threshold of greater duration for the duration of an element 113, it is possible to use a threshold of greater duration which depends on the speaker 40 and the element 113 considered. For example, for an orator X, it may be pre-established that he says the word 'two' generally in 300 ms and that he says the word seven in 350 ms. If the voice recognition engine 10 then provides a word 'two' with a duration of 600 ms, there can be serious doubts as to its validity.
A predetermined statistic can be obtained through a preliminary step for example, in which a speaker 40 said a calibration message 50 to a voice recognition engine 10, said calibration message 50 preferably comprising different words. With the result 100 of the calibration message 50 provided by the voice recognition engine 10, it is possible to deduce one or more statistics associated with the different elements.
This preferred version of the selection rule of step B. does not prevent, according to certain variants, from using in addition certain absolute criteria, common to all the elements 113 and / or to all the speakers 40, to select an element 113. Thus, it is possible, for example, to reject any element 113 of the result 100 which has a duration greater than two seconds and / or which has a confidence rate 160 lower than a minimum absolute confidence level 161, for example 3000 ( preferred value for Nuance's VoCon® 3200 V3.14 model, but can be adapted to other voice recognition engine models 10).
An example of a speech recognition result 100 is a solution of a message 50 to which one or more post-processing operations are applied. For example, a result 100 may be an assumption provided by a voice recognition engine 10 that is filtered, to keep only one or more items 113 deemed valid. When the results 100 of step A. are obtained from a filtering operation, such a filtering operation is preferably applied from the end to the beginning of each hypothesis, preferably by retaining only the elements of each hypothesis that are valid consecutive. This allows for a more efficient and reliable filtering operation. FIG. 2 illustrates the principle of a filtering operation applied to a result 100, from the end to the beginning, as shown by the arrow at the top of the figure. The elements 113 in full lines are considered valid. The elements 113 in broken lines are considered invalid.
According to a second aspect, the invention relates to a system 11 (or device) for determining an optimized solution 105. FIG. 4 diagrammatically illustrates such a system 11 in combination with a voice recognition engine 10, a screen 20 and an auxiliary device 15. In this figure, the system 11 and the voice recognition engine 10 are two separate devices. According to another possible version, the system 11 is integrated into a voice recognition engine 10 so that it is not possible to differentiate them. In such a case, a conventional speech recognition engine 10 is modified or adapted to perform the functions of the system 11 described below. The auxiliary device 15 makes it possible to provide one or more preset statistics 140 for one or more elements 113 and for one or more speakers 40. In FIG. 4, this auxiliary device 15 is separated from the system 11. According to another possible version, the device auxiliary is a module that is integrated at the level of the system 11 so that it is not possible to differentiate them. It could also be provided that the auxiliary module 15 is integrated in the speech recognition engine 10. Preferably, the auxiliary module 15 is able to communicate with the voice recognition engine 10. The auxiliary module 15 is for example a memory module .
Examples of system 11 are: a computer, a speech recognition engine 10 adapted or programmed to perform a method according to the first aspect of the invention, a hardware module (or hardware) of a voice recognition engine 10, a hardware module capable of communicating with a voice recognition engine 10. Other examples are nevertheless possible. The system 11 includes acquisition means 12 for receiving (preferably reading) a plurality of voice recognition results 100. Preferably, these acquisition means 12 are also able to receive (preferably read) one or more pre-established statistics. Examples of acquisition means 12 are: an input port of the post-processing system 11, for example a USB port, a port
Ethernet, a wireless port (eg WIFI). Other examples of acquisition means 12 are nevertheless possible.
The system 11 further comprises processing means 13 for selecting, using a selection rule, one or more elements 113 of one or more results 100 and generating an optimized solution 105 from the at least one selected element 113. The selection rule may comprise different selection steps as has been presented above for the method according to the first aspect of the invention. Preferably, the system 11 is able to send an optimized solution 105 to a screen 20 to display it.
Examples of processing means 13 are: a control unit, a processor or central processing unit, a controller, a chip, a microchip, an integrated circuit, a multi-core processor. Other examples known to those skilled in the art are nevertheless possible.
Preferably, the system 11 is also able to perform a filtering post-processing method of a result of voice recognition 100. In this case, the processing means 13 are preferably capable of applying to it such a filtering post-processing method.
According to a third aspect, the invention relates to a program, preferably a computer program. Preferably, this program is part of a human-machine voice interface.
According to a fourth aspect, the invention relates to a storage medium that can be connected to a device, for example a computer that can communicate with a voice recognition engine 10. According to another possible variant, this device is a motor voice recognition 10. Examples of storage medium according to the invention are: a USB key, an external hard disk, a CD-ROM type disk. Other examples are nevertheless possible.
The present invention has been described in relation to specific embodiments, which have a purely illustrative value and should not be considered as limiting. In general, the present invention is not limited to the examples illustrated and / or described above. The use of the verbs "to understand", "to include", "to include", or any other variant, as well as their conjugations, can in no way exclude the presence of elements other than those mentioned. The use of the indefinite article "a", "an", or the definite article "the", "the" or "the", to introduce an element does not exclude the presence of a plurality of these elements. The reference numerals in the claims do not limit their scope.
In summary, the invention can also be described as follows. A method for generating an optimized speech recognition solution 105 and comprising the steps of: receiving a plurality of speech recognition results 100 each comprising one or more elements 113 and each being a solution of a message 50 spoken by a speaker 40; selecting, using a selection rule, one or more items 113 of one or more results 100 of said plurality of speech recognition results 100; generating said optimized solution 105 from the at least one selected element 113.

权利要求:
Claims (13)
[1]
Amended claims
A method for generating an optimized speech recognition solution (105) and comprising the following steps: A. receiving a plurality of voice recognition results (100), each comprising one or more elements (113) and each being an answer solution. a message (50) spoken by a speaker (40); B. selecting one or more items (113) belonging to one or more results (100) of said plurality of voice recognition results (100) of step A. using a selection rule, said selection rule comprising a step selecting an element (113) if its duration is greater than or equal to a threshold of shorter duration and / or if its duration is less than or equal to a threshold of greater duration; C. generating said optimized solution (105) from the at least one element (113) selected in step B.
[2]
2. Method according to the preceding claim characterized in that the elements (113) of said results (100) of voice recognition of step A. belong to the same vocabulary.
[3]
3. Method according to any one of the preceding claims, characterized in that the results (100) of voice recognition of step A. are each a solution of the same message (50) said by said speaker (40).
[4]
4. Method according to any one of the preceding claims characterized in that each result (100) of said plurality of results (100) of voice recognition of step A. is characterized by a score of confidence greater than or equal to a score of minimum confidence.
[5]
5. Method according to any one of the preceding claims characterized in that said selection rule of step B. comprises a step of selecting an element (113) if it is identical to at least two results (100) of said plurality of results (100) of step A. and if it is included, within a margin of error, in the same time interval, along a time scale, t, associated with each of said two results (100).
[6]
6. Method according to any one of the preceding claims characterized in that each element (113) of said results (100) of step A. is characterized by a confidence rate (160) and in that said selection rule of step B comprises a step of selecting an item (113) if its confidence level (160) is greater than or equal to a minimum confidence level (161).
[7]
7. Method according to any one of the preceding claims characterized in that: - each element (113) of said results (100) of step A. is characterized by a confidence rate (160), and in that - said selection rule of step B. includes a step of selecting an item (113) of a result (100) if its confidence level (160) is greater than or equal to the confidence level (160) of one or more other elements (113) of one or more other results (100) and understood, within a range of errors, in the same time interval as said element (113), along a time scale, t, associated to each of the results (100).
[8]
8. Method according to any one of the preceding claims characterized in that said selection rule of step B. comprises a step of selecting an element (113) of a result (100) if a time interval (170) the separating from another directly adjacent element (113) belonging to the same result (100) is greater than or equal to a minimum time interval.
[9]
9. Method according to any one of the preceding claims characterized in that said selection rule of step B. comprises a step of selecting an element (113) of a result (100) if a time interval (170) the separating from another directly adjacent element (113) belonging to the same result (100) is less than or equal to a maximum time interval.
[10]
10. Method according to any one of the preceding claims characterized in that said selection rule of step B. comprises a step of selecting, for a given speaker (40), an element (113) of a result (100). ), if a statistic associated with this element (113) of this result (100) conforms, within a range, to a preset statistic for the same element (113) and for that given speaker (40).
[11]
11.System (11) for generating an optimized solution (105) in speech recognition and comprising: - acquisition means (12) for receiving a plurality of voice recognition results (100), each comprising one or more elements (113); ) and each being a transcription of a message (50) said by a speaker (40); processing means (13) for: selecting, using a selection rule, one or more elements (113) of one or more results (100) of said plurality of recognition results (100); voice, said selection rule comprising a step of selecting an element (113) if its duration is greater than or equal to a threshold of shorter duration and / or if its duration is less than or equal to a threshold of greater duration; generating said optimized solution (105) from the at least one selected element (113).
[12]
A program for generating an optimized speech recognition solution (105) including code for enabling a device to perform the following steps: A. receiving a plurality of voice recognition results (100), each including one or more a plurality of elements (113) and each being a transcription of a message (50) spoken by a speaker (40); B. selecting one or more items (113) belonging to one or more results (100) of said plurality of voice recognition results (100) of step A. using a selection rule, said selection rule comprising a step selecting an element (113) if its duration is greater than or equal to a threshold of shorter duration and / or if its duration is less than or equal to a threshold of greater duration; C. generating said optimized solution (105) from the at least one element (113) selected in step B.
[13]
13. A storage medium connectable to a device and including instructions, which read, enable said device to generate an optimized voice recognition solution (105), said instructions for imposing on said device to perform the following steps: receiving a plurality of voice recognition results (100), each comprising one or more elements (113) and each being a transcription of a spoken message (50) by a speaker (40); B. selecting one or more items (113) belonging to one or more results (100) of said plurality of voice recognition results (100) of step A. using a selection rule, said selection rule comprising a step selecting an element (113) if its duration is greater than or equal to a threshold of shorter duration and / or if its duration is less than or equal to a threshold of greater duration; C. generating said optimized solution (105) from the at least one element (113) selected in step B.

类似技术:

公开号 | 公开日 | 专利标题

EP1362343B1|2007-08-29|Method, module, device and server for voice recognition

RU2439716C2|2012-01-10|Detection of telephone answering machine by voice recognition

US20140142944A1|2014-05-22|Diarization Using Acoustic Labeling

US20060241948A1|2006-10-26|Method and apparatus for obtaining complete speech signals for speech recognition applications

EP1543502B1|2007-10-31|Voice recognition method with automatic correction

EP1886304A1|2008-02-13|Method, device and computer programme for speech recognition

US10049658B2|2018-08-14|Method for training an automatic speech recognition system

WO2009071795A1|2009-06-11|Automatic simultaneous interpretation system

EP1647897A1|2006-04-19|Automatic generation of correction rules for concept sequences

BE1023458B1|2017-03-27|Method and system for generating an optimized voice recognition solution

BE1023435B1|2017-03-20|Method and system for post-processing a speech recognition result

US11272137B1|2022-03-08|Editing text in video captions

BE1023427B1|2017-03-16|Method and system for determining the validity of an element of a speech recognition result

FR2769117A1|1999-04-02|LEARNING PROCESS IN A SPEECH RECOGNITION SYSTEM

EP1285435B1|2007-03-21|Syntactic and semantic analysis of voice commands

FR3058253B1|2019-07-12|METHOD FOR PROCESSING AUDIO DATA FROM A VOICE EXCHANGE, SYSTEM AND CORRESPONDING COMPUTER PROGRAM.

FR2867583A1|2005-09-16|Semantic, syntax and lexical electronic proof reader for e.g. dyslexic person, has vocal interaction module to select expression matching most phonetically with dictated expression automatically and replace wrong expression in digital text

US20210306457A1|2021-09-30|Method and apparatus for behavioral analysis of a conversation

EP3627510A1|2020-03-25|Filtering of an audio signal acquired by a voice recognition system

FR3104797A1|2021-06-18|PROCESS FOR IDENTIFYING AT LEAST ONE PERSON ON BOARD A MOTOR VEHICLE BY VOICE ANALYSIS

FR2790586A1|2000-09-08|Foreign language pronunciation interactive speech recognition learning method having computer data base vocabulary and similar/different phonetic parts recognition/vocabulary list comparison.

EP1665231B1|2008-03-05|Method for unsupervised doping and rejection of words not in a vocabulary in vocal recognition

FR3111004A1|2021-12-03|Method of identifying a speaker

FR2966635A1|2012-04-27|Method for displaying e.g. song lyrics of audio content under form of text on e.g. smartphone, involves recognizing voice data of audio content, and displaying recognized voice data in form of text on device

FR3101473A1|2021-04-02|Connected conversation system, associated method and program

同族专利:

公开号 | 公开日

BE1023458A1|2017-03-27|

EP3065133A1|2016-09-07|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

EP1067512A1|1999-07-08|2001-01-10|Sony International GmbH|Method for determining a confidence measure for speech recognition|

US20100250250A1|2009-03-30|2010-09-30|Jonathan Wiggs|Systems and methods for generating a hybrid text string from two or more text strings generated by multiple automated speech recognition systems|

US20120084086A1|2010-09-30|2012-04-05|At&T Intellectual Property I, L.P.|System and method for open speech recognition|

US6754629B1|2000-09-08|2004-06-22|Qualcomm Incorporated|System and method for automatic voice recognition using mapping|CN110853635A|2019-10-14|2020-02-28|广东美的白色家电技术创新中心有限公司|Speech recognition method, audio annotation method, computer equipment and storage device|

法律状态:

优先权:

申请号 | 申请日 | 专利标题

EP15157924.0|2015-03-06|

EP15157924.0A|EP3065133A1|2015-03-06|2015-03-06|Method and system for generating an optimised solution in speech recognition|

[返回顶部]