比利时专利BE1023427B1 Method and system for determining the validity of an element of a speech recognition result

专利PDF首页>>比利时专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
A method of determining the validity of an item (113) of a voice recognition result (100) comprising the steps of: providing a value of a voice recognition parameter, which is predetermined for the speaker (40) and for an element (113) identical to said element (113) of said result (100) whose validity is sought; obtaining from said result (100) a value of the same voice recognition parameter for said element (113) whose validity is sought; determining that said element (113) of said result (100) is valid if the value of the speech recognition parameter obtained from said result (100) is identical, within an interval, to the predetermined value for said speaker (40) for that element (113).
公开号:BE1023427B1
申请号:E2016/5153
申请日:2016-03-02
公开日:2017-03-16
发明作者:Jean-Luc Forster
申请人:Zetes Industries Sa；
IPC主号:

专利说明:

Method and system for determining the validity of an element of a speech recognition result
Field of the Invention [0001] The invention relates to the field of speech recognition. In particular and according to a first aspect, the invention relates to a method for determining the validity of an element (for example word) of a speech recognition result. According to a second aspect, the invention relates to a system (or device). According to a third aspect, the invention relates to a program. According to a fourth aspect, the invention relates to a storage medium comprising instructions (for example: USB key, CD-ROM type disk or DVD). STATE OF THE ART [0002] A speech recognition engine makes it possible to provide, from a spoken or audio message, a result that is generally in the form of a text or code that can be used by a machine. Such a result is for example a hypothesis provided by the speech recognition engine. Speech recognition is now widespread and is considered very useful. Various uses of voice recognition are taught in US6,754,629B1, among others.
Methods exist to improve the assumptions provided by a voice recognition engine. For example, US2014 / 0278418A1 proposes to use the identity of a speaker to adapt speech recognition algorithms to specific pronunciation of said speaker. This adaptation is done within the voice recognition engine, for example by modifying its phonetic dictionary to take into account how the speaker or user speaks.
A speech recognition result generally comprises a sequence of elements, for example words, separated by silences or by time intervals not including words recognized by the voice recognition engine. Such a result is characterized by a beginning and an end and its elements are temporally arranged between this beginning and this end.
A voice recognition result can be used for example to enter information in a computer system, for example an article number or any instruction to be made. Rather than using a raw result, one or more post-processing operations are sometimes applied to the result to extract a post-processed solution. For example, it is possible to browse a result from a speech recognition engine from the beginning to the end and to remember, for example, the first five elements considered valid, if it is known that the useful information does not include no more than five elements (an element is for example a word). Indeed, knowing that the useful information (a code for example) does not include more than five words or five digits, it is then sometimes decided to retain only the first five valid elements of a result from a recognition engine voice.
One of the difficulties is to determine which elements of the speech recognition result are valid and which elements are invalid. It may be that a voice recognition engine provides a 'two' element, while the user has actually said 'uh'. It is important to be able to determine that the 'two' element provided by the voice recognition engine is invalid in this case.
A voice recognition engine can generally provide a confidence rate associated with each element (each word for example) of the result. One way to determine the validity of an element is to consider it valid if the confidence rate associated with it is greater than or equal to a given threshold value, for example in this case above a confidence level. given (for example 6000 if the confidence rate can vary between 0 and 10 000). The inventors found that such a procedure was not satisfactory: some valid elements are rejected and other invalid elements are kept. Such a procedure is therefore not reliable enough. SUMMARY OF THE INVENTION According to a first aspect, one of the objects of the invention is to provide a method for determining the validity of an element of a speech recognition result that is more reliable. For this purpose, the inventors propose the following method. A method of determining the validity of an item of a speech recognition result, said result being generated by a speech recognition algorithm applied to a speaker's spoken message, said method comprising the steps of: a) obtaining (or receiving) a value of a speech recognition parameter, which is predetermined: for said speaker, and for an element which is identical to said element of said result whose validity is sought; b) obtaining a value of the same speech recognition parameter as that of step a) for said element of said speech recognition result whose validity is sought; c) comparing the values obtained in steps a) and b) to determine the validity of said element of said speech recognition result.
Thus, the inventors do not propose to use a threshold value common to all the speakers and all the elements, to determine if an element is valid or not. In particular, the value of a voice recognition parameter associated with an element is not compared to a threshold value common to all speakers and to all elements to determine whether said element is valid or not. The inventors propose rather to use the profile of the speaker when determining the validity of an element of the result in post-processing. Thus, a value of a speech recognition parameter associated with an element of the speech recognition result is compared with a value of the same speech recognition parameter but predetermined for said speaker and for the element in question. It may be that for certain elements and for a given speaker, the confidence rates associated with these elements are generally low, while the elements are valid. This could happen, for example, if the speaker's accent is particularly strong for such elements, so that the pronunciation of these elements is strongly different from the phonetic dictionary of the speech recognition engine which provides low confidence rates for these elements. elements. Using an absolute confidence rate, common for all speakers and for all elements does not provide good results in this case. Taking into account the profile of the user in post-processing, the method of the invention provides better performance for determining the validity of an element of a speech recognition result. She is more reliable. Thus, it induces fewer errors for determining the validity or invalidity of an element of a speech recognition result.
Preferably, several predetermined values of several speech recognition parameters are provided in step a) for said speaker and for an element that is identical to said element of said result whose validity is sought. For example, a duration value of the element and a value of a confidence rate of the element can be provided. In this case, it is also possible to obtain several values of the same voice recognition parameters in step b) for the element whose validity is sought. And finally, one can, in these cases, compare the values of the different voice recognition parameters in step c) to determine whether the element is valid or not. For example, it can be concluded that an element is valid if all the values of step b) correspond, within an interval, to the predetermined values of step a). In general, one value is used per voice recognition parameter, but it might be possible to use several values per voice recognition parameter for some of them.
In step b), one can obtain the value of the voice recognition parameter of the element whose validity is sought in several ways. Preferably, this value is obtained via the voice recognition engine which provides said result.
The method of the invention therefore does not relate to the adaptation of a phonetic dictionary of a voice recognition engine. The method of the invention relates rather to the post-processing of a speech recognition result.
The method of the invention applies to an element (eg word) of a speech recognition result. A speech recognition result generally comprises a plurality of elements. It is possible to apply the method to several elements of the result to determine the validity of each one of them. In this case, it applies to each of the several elements of the result. Generally, an element is a group of phonemes. A phoneme is known to those skilled in the art. Preferably, an element is a word. An element can also be a group or a combination of words. An example of a combination of words is 'cancel operation'.
The method of the invention can be used in a filtering post-processing method of a speech recognition result which aims to select valid elements of such a result. It makes it easier to reject false results and noise.
A result from a speech recognition engine is generally in the form of a text or a code that can be used by a machine. An element of a result represents information of the result delimited by two different times along a time scale, t, associated with the result, and which is not considered as a silence, or a background noise for example. In general, two elements of a speech recognition result are separated by a time interval during which the speech recognition engine does not recognize an element (word for example).
In the context of the invention, a voice recognition result can be of different types. According to a first possible example, a speech recognition result represents a hypothesis provided by a speech recognition engine from a message said by a user or speaker. In general, a speech recognition engine provides several (for example, three) hypotheses from a message said by a user.
In this case, it usually also provides a score (usually expressed as a percentage) for each hypothesis.
According to another possible example, a voice recognition result is a solution, generally comprising a plurality of elements, obtained from one or more post-processing operations applied to one or more hypotheses (s). ) provided by a voice recognition engine. In this other possible example, the result is therefore derived from a voice recognition module and derived from one or more post-processing module (s) of one or more hypotheses provided by a speech recognition engine. .
The method of the invention has other advantages. In particular, it is simple to implement. In particular, it could be integrated in a post-processing module of a speech recognition engine.
The method of the invention also has the advantage of being based on a principle of self-learning, self-determination of speech recognition parameter values specific to a given speaker for different elements. It is indeed possible to begin the determination of the validity of an element of a speech recognition result without using the method of the invention, for example by comparing one or more speech recognition parameters associated with an element whose validity is searched at one or more predefined thresholds, not specific to the speaker. Subsequently, depending on the values obtained over time for said one or more voice recognition parameters for different elements and for different speakers, specific values for these different speakers of this or these parameters can be determined.
Preferably, the method of the invention further comprises a preliminary step of voice recognition of a calibration message said by said speaker (the same speaker), said calibration message comprising a spoken element identical to said element of said result speech recognition whose validity is sought. According to this preferred variant, the method of the invention comprises a preliminary step of recording a calibration message by said speaker. Preferably, this message comprises a plurality of elements (words for example) that said speaker must say. For the content of the message, it will preferably choose a content including elements that said speaker is supposed to say later, during handling operations for example. During this preliminary step, one or more values of one or more voice recognition parameters associated with each of said plurality of elements are then recorded, these values being able to be provided by a speech recognition engine to retranscribe the calibration message. It is first of all ascertained that the elements of the voice recognition result (which is derived from the calibration message), for which the value or values of one or more voice recognition parameters are recorded, correspond to the 'true' elements of the message. according to the speaker. If this is the case, for example, the duration of each of the elements of the result resulting from the calibration message said by the speaker is recorded. According to another example, a confidence rate associated with each element of the result resulting from the calibration message is recorded. The values of such voice recognition parameters, duration and confidence rate, may be provided by a voice recognition engine. With this preliminary step, it is possible to easily provide the predetermined value of the voice recognition parameter of step a) of the method of the invention.
Preferably, step c) comprises the following step c) 1): if the value of the voice recognition parameter obtained in step b) corresponds, within a range, to the predetermined value of the parameter of voice recognition obtained in step a), determining that said element of said voice recognition result is valid.
Preferably, step c) comprises the following step c) 2): if the value of the speech recognition parameter obtained in step b) does not correspond, within an interval, to the predetermined value of the recognition parameter voice obtained in step a), determining that said element of said result (100) is invalid.
Preferably, said interval is between -20% and + 20% with respect to the predetermined value of step a). Thus, in this preferred case, it will be deduced that a word is valid if: (value of step b) = (value of step a) +/- 20% of the value of step a) . In particular, if the value of step a) is 100 ms (in which case the speech recognition parameter corresponds to a duration), it will be deduced that an element is valid if its duration is between 80 and 120 ms.
More preferably, said interval is between -10% and + 10% with respect to the predetermined value of step a). Even more preferably, said interval is between -5% and + 5% with respect to the predetermined value of step a).
[0023] Preferably, said element is a word. Examples of words are: one, two, car, umbrella. According to this preferred variant, the method of the invention gives even better results. Each word is determined from a message said by a user by a speech recognition engine using a dictionary. Grammar rules may reduce the choice of possible words from a dictionary.
Preferably, said voice recognition parameter of step a) is a confidence rate. Thus, in this preferred embodiment, the voice recognition parameter uses a confidence rate of an element. Preferably, it is obtained in step b) from the voice recognition engine from which said result. This type of speech recognition parameter is in fact generally provided by a speech recognition engine. This variant of the method of the invention is therefore particularly easy to implement.
[0025] Preferably, said voice recognition parameter of step a) is a duration. Thus, in this preferred embodiment, a duration of an element is used as the speech recognition parameter. Preferably, it is obtained in step b) from the voice recognition engine from which said result. Each element of a speech recognition result corresponds to a duration or time interval. This type of speech recognition parameter is generally provided by the voice recognition engine. This variant of the method of the invention is therefore particularly easy to implement.
Preferably, two values of the two following speech recognition parameters, confidence rate and duration, are used to determine the validity of an element.
According to another possible variant, said voice recognition parameter of step a) is a time interval separating an element from another directly adjacent member in a voice recognition result. The value of such a voice recognition parameter can be predetermined during a preliminary calibration step. Preferably, it is a time interval separating one element from another directly posterior in a voice recognition result, ie directly adjacent to the end of the speech recognition result.
[0028] Preferably, the method of the invention further comprises: a step of providing, for an element which is identical to said element of said result and for different speakers of which said speaker of said message, different predetermined values of the same recognition parameter; vocal than that of step a), each of said different predetermined values being associated with each of said speakers; and - a step of selecting from among said different predetermined values of said speech recognition parameter one of which corresponds to the speaker of said message, to obtain said predetermined value of step a).
With this preferred embodiment, the method of the invention can be easily used for different speakers. For this preferred embodiment, a predetermined value of the speech recognition parameter of said different predetermined values is therefore associated with a single speaker.
Preferably, the method of the invention further comprises: a step of providing, for said speaker and for different elements of which said element whose validity is sought, different predetermined values of the same voice recognition parameter as that of step a), each of said different predetermined values being associated with each of said different elements; and a step of selecting from among said different predetermined values of said speech recognition parameter one of which corresponds to an element which is identical to said element of said voice recognition result, to obtain said predetermined value of step a ). With this preferred embodiment, the method of the invention can be easily used for different elements (for example different words). For this preferred embodiment, a predetermined value of the speech recognition parameter of said different predetermined values is therefore associated with a single element.
Preferably, the method further comprises: a step of providing, for different speakers including said speaker of said message and for different elements of which said element whose validity is sought, different predetermined values of the same voice recognition parameter as that of step a), each of said different predetermined values being associated with only one of said different speakers and only one of said different elements; and - a step of selecting from among said different predetermined values of said speech recognition parameter one of which corresponds to the speaker of said message and to one element which is identical to said element of said voice recognition result, to obtain said predetermined value of step a).
With this preferred embodiment, the method of the invention can be easily used for different speakers and different elements (for example different words).
According to another example, the method of the invention further comprises a step of determining an identity of said speaker of said message. An identity of said speaker can be determined in different ways. According to a first possible variant, it is determined from a session opening of the speaker who is asked to enter his name in a computer system. According to another possible variant, a speech recognition engine from which said result is derived is able to recognize the identity of said speaker of said message. Other variants are possible.
The inventors also propose a method of postprocessing a speech recognition result using a validation determination method of an element of said result as described above. This method of post-processing a speech recognition result which includes a beginning, an end and a plurality of elements distributed between said beginning and said end comprises the following steps: i. receive said result; ii. isolating an element of said plurality of elements that has not passed the validation test of step iii.a. ; iii. then. if an element was isolated in step ii., determine if it is valid using a validation test, b. otherwise, go directly to step v. ; iv. repeat steps ii. and iii. (in the following order: step ii, then step iii.); v. if at least one element has been determined valid in step iii.a, determining a post-processed solution using (or resuming) at least one determined element valid in step iii.a.
This post-processing method is characterized in that said validation test of step iii.a., to determine if the element isolated in step ii. is valid, includes a method for determining the validity of an element of a result as described above.
This post-processing method is particularly efficient and reliable thanks to the advantages provided by the method of determining the validity of an element of the result.
If no element has been determined valid in step iii.a, step v preferably comprises the following sub-step: determining a post-processed solution that does not include any element of said result. In this preferred variant and when no element has been determined valid in step iii.a, various examples of post-processed solution are: empty message that is to say not including any element (no word for example) , message stating that the postprocessing was unsuccessful, result provided by voice recognition engine (no result filtering in this case).
[0034] Preferably, each element isolated in step ii. is selected from said end of the result at the beginning of the result consecutively.
Consecutive means without passing an element. Thus, we consider the different elements in descending chronological order without passing elements. A result of speech recognition is thus traveled from the end to the beginning with this preferred variant. The inventors have indeed discovered that a person dictating a message to a voice recognition engine was more likely to hesitate and / or to err at first than at the end. By processing a speech recognition result from the end rather than from the beginning, the post-processing method of this preferred variant favors the part of the result that is most likely to have the right information. In the end, this method is therefore more reliable.
Take the following example. Imagine that a code to read is: 4531. The operator, reading it, says: "5, 4, uh, 4, 5, 3, 1" and a voice recognition engine provides the following result: "5, 4, 2, 4, 5, 3, 1 ") as text to a post-processing system. Let's assume that this post-processing system (which can be built into a speech engine) knows that the code should not have more than four good elements (numbers in this case). A post-processing method that traverses the result from the beginning to the end of the result will provide as a post-processed solution: 5424 (and not 4531). The post-treatment method of the invention will provide 4531, i.e. the correct solution.
The inventors have noticed that the situation illustrated by this example, that is to say the fact that an operator is more likely to hesitate or err at the beginning than at the end of the recorded sequence, is more frequent than reverse. Thus, overall, the post-treatment method according to this preferred variant is more reliable because it provides less bad results. The chances of getting a correct post-processed solution are also higher. The post-treatment method according to this preferred variant is therefore also more effective.
The post-treatment method of the invention has other advantages. It is easy to implement. In particular, it does not require many implementation steps. The implementation steps are also simple. These aspects facilitate its integration, for example at the level of a computer system using a voice recognition result, or at the level of a voice recognition engine, for example.
The post-processing method of a speech recognition result can be seen as a method of filtering a speech recognition result: indeed, the invalid elements are not used to determine the post-processed solution .
[0039] Preferably, step iii.a. further comprises an instruction to proceed directly to step vi. if the element undergoing the validation test of step iii.a is not determined to be valid. According to this preferred variant, a post-treated solution for which at least one element has been selected in step iv. includes only valid consecutive elements of the speech recognition engine result. The reliability of the method is then further improved because only a series of valid consecutive elements are kept. Preferably, the post-processing method comprises the following step: vi. determining whether said post-processed solution of step v. satisfies a grammar rule.
By using a grammar rule, the reliability of the method of the invention can be further increased. In particular, it is better to filter out an aberrant result. An example of a grammar rule is an interval of allowed word numbers for the post-processed solution. For example, one could define as grammar rule: the post-processed solution must contain between three and six words. Preferably, when a grammar rule is used, the method of the invention further comprises the following step: vii. at. if the answer to the test of step vi. is positive, provide said post-treated solution, b. otherwise, provide said voice recognition result.
According to another possible variant, the method of the invention comprises the following step when a grammar rule is used: vii. at. if the answer to the test of step vi. is positive (that is, the post-processed solution satisfies the grammar rule), provide the post-processed solution, b. if the answer to the test of step vi. is negative (that is, the post-processed solution does not meet the grammar rule), does not provide a post-processed solution, or provides a blank message, or provide a message that no post-solution satisfactory treatment could not be determined.
In addition to the fact that the validation test of step iii.a includes a method for determining the validity of an element as described for the first aspect of the invention, this validation test may comprise one or more other steps. According to a first possible example, said validation test of step iii.a. also includes a step of considering a valid element if its duration is greater than or equal to a threshold of shorter duration. Each element of the result corresponds to a duration or time interval which is generally provided by the speech recognition engine. With this preferred embodiment, it is possible to overcome more effectively elements that are short-lived, such as a parasitic noise that may be from a machine.
According to another possible example, said validation test of step iii.a. includes a step of considering a valid element if its duration is less than or equal to a threshold of greater duration.
With this preferred embodiment, it is possible to overcome more effectively elements that are of long duration, such as for example a hesitation of a speaker who says for example 'uh'. By using this preferred embodiment, it will be easier to eliminate 'uh' and thus avoid confusing it with 'two'.
According to another possible example, said validation test of step iii.a. includes a step of considering a valid item if its confidence rate is greater than or equal to a minimum confidence level.
The reliability of the post-processing method is further increased in this case.
According to another possible example, said validation test of step iii.a. comprises a step of considering a valid element if a time interval separating it from another directly adjacent element towards said end of the result is greater than or equal to a minimum time interval.
With this preferred embodiment, it is possible to reject more efficiently elements that are not generated by a human being but rather by a machine for example and which are temporally very close together.
According to another possible example, said validation test of step iii.a. comprises a step of considering a valid element if a time interval separating it from another directly adjacent element towards said end of the result is less than or equal to a maximum time interval.
With this variant, it is possible to more effectively reject elements that are temporally greatly separated from each other.
Preferably, all the elements determined valid in step iii.a are taken again to determine said solution post-treated in step v.
The inventors also propose a method for generating an optimized solution in voice recognition and comprising the following steps: receiving a plurality of voice recognition results, each comprising one or more elements; selecting one or more valid elements of one or more results of said plurality of results, said one or more elements being determined valid by any of the preferred methods of the first aspect of the invention; generating said optimized solution from the at least one valid element selected in the previous step.
According to a second aspect, the invention relates to a system (or device) for determining the validity of an element of a speech recognition result, said result being derived from a voice recognition algorithm applied to a speech recognition algorithm. message said by a speaker, said device comprising: - acquisition means for receiving a value of a speech recognition parameter, which is predetermined for said speaker and for an element which is identical to said element of said voice recognition result whose validity is sought; acquisition means for receiving a value of the same voice recognition parameter for the element of said voice recognition result whose validity is sought; processing means for comparing the values received by the acquisition means to determine the validity of said element of said speech recognition result
The advantages associated with the method according to the first aspect of the invention apply to the system of the invention, mutatis mutandis. Thus, in particular, it is possible to determine the validity of an item of a speech recognition result more reliably and more effectively with the system of the invention. The various embodiments presented for the method according to the first aspect of the invention, apply to the system of the invention, mutatis mutandis.
According to a third aspect, the invention relates to a program (preferably a computer program) for determining the validity of an element of a speech recognition result, said result being derived from a voice recognition algorithm applied to a speaker-spoken message, said program comprising code to enable a device to perform the following steps: a) read a value of one (or more) voice recognition parameters which is predetermined for said speaker and for an element that is identical to said element of said voice recognition result; b) reading one (or more) values of the same speech recognition parameter as that of step a) for said element of said speech recognition result; c) comparing the values obtained in steps a) and b) to determine the validity of said element of said speech recognition result.
Preferably, said device is a speech recognition engine or a computer that can communicate with a voice recognition engine.
The advantages associated with the method and the system according to the first and second aspects of the invention apply to the program of the invention, mutatis mutandis. Thus, in particular, it is possible to determine the validity of an element of a speech recognition result more efficiently and reliably with the program of the invention. The various embodiments presented for the method according to the first aspect of the invention, apply to the program of the invention, mutatis mutandis.
According to a fourth aspect, the invention relates to a storage medium that can be connected to a device and comprising instructions, which are read, enable said device to determine the validity of an element of a voice recognition result. , said result being derived from a speech recognition algorithm applied to a message said by a speaker, said instructions making it possible to impose on said device to perform the following steps: a) reading a value of one (or more) parameter of voice recognition which is predetermined for said speaker and for an element which is identical to said element of said speech recognition result; b) read one (or more) value of the same voice recognition parameter as that of step a) for said element of said result; c) comparing the values obtained in steps a) and b) to determine the validity of said element of said speech recognition result.
Preferably, said device is a speech recognition engine or a computer that can communicate with a voice recognition engine.
The advantages associated with the method according to the first aspect of the invention, apply to the storage medium of the invention, mutatis mutandis. Thus, in particular, it is possible to determine the validity of an element of a speech recognition result more efficiently and more reliably with the storage medium of the invention. The various embodiments presented for the method according to the first aspect of the invention, apply to the storage medium of the invention, mutatis mutandis.
BRIEF DESCRIPTION OF THE FIGURES [0053] These aspects as well as other aspects of the invention will be clarified in the detailed description of particular embodiments of the invention, reference being made to the drawings of the figures, in which: FIG. schematically shows a speaker saying a message that is processed by a speech recognition engine; Fig.2 shows schematically an example of a result from a speech recognition engine; Fig. 3 schematically shows a preferred version of a post-processing method of a speech recognition result; Fig.4 shows schematically an example of a system according to the invention.
The drawings of the figures are not to scale. Generally, similar elements are denoted by similar references in the figures. The presence of reference numbers in the drawings can not be considered as limiting, even when these numbers are indicated in the claims.
DETAILED DESCRIPTION OF PARTICULAR EMBODIMENTS [0054] FIG. 1 shows a speaker 40 (or user 40) saying a message 50 to a microphone 5. This message 50 is then transferred to a voice recognition engine 10 which is known to a user. skilled person. Different models and different brands are available on the market. In general, the microphone 5 is part of the speech recognition engine 10. The latter processes the message 50 with speech recognition algorithms, based for example on a hidden Markov model (MMC). It results in a result 100 from the voice recognition engine 10. An example of result 100 is a hypothesis generated by the voice recognition engine 10. Another example of result 100 is a solution obtained from the speech recognition algorithms and to from postprocessing operations which are for example applied to one or more hypotheses generated by the voice recognition engine 10. Postprocessing modules to provide such a solution can be part of the voice recognition engine 10. The result 100 is generally under the form of a text that can be deciphered by a machine, a computer or a processing unit for example. The result 100 is characterized by a beginning 111 and a end 112. The beginning 111 is prior to said end 112 along a time scale, t. The result 100 generally comprises a plurality of elements 113 temporally distributed between the beginning 111 and the end 112. An element 113 represents information between two different times along the time scale, t. In general, an element 113 is a portion of the result 100 that is not considered silence and / or background noise. An example of element 113 is a word. In general, the various elements 113 are separated by portions of the result 100 representing a silence, a background noise, or a time interval during which no element 113 (word for example) is recognized by the speech recognition engine 10 .
The message 50 includes a certain number of spoken elements, for example spoken words such as: one, two, car, umbrella. A voice recognition result 100 comprises a number of elements 113 (or words). These elements 113 are transcribed by a voice recognition engine 10 from the message 50 comprising the spoken elements, generally in a format readable by a human or a computer for example. If the message 50 includes the following spoken elements: one, two, car, umbrella, the 100 speech recognition result ideally includes the following 113 elements: one, two, car, umbrella.
FIG. 2 shows an exemplary result 100 of voice recognition. Between its beginning 111 and its end 112, the result 100 comprises several elements 113, seven in the case illustrated in FIG. 2. In this figure, the elements 113 are represented as a function of time, t (abscissa). The ordinate, C, represents a level or rate of confidence. This concept is known to a person skilled in the art. This is a property or statistic generally associated with each item 113 and can be provided by a speech recognition engine 10 in general. A confidence rate represents, in general, a probability that an item of a speech recognition result 100 from a spoken item is the correct one. A confidence rate is known to those skilled in the art. An example of a speech recognition engine 10 is the VoCon® 3200 V3.14 model from Nuance. In this case, the confidence rate varies between 0 and 10 000. A value of 0 refers to a minimum value of a confidence rate (very low probability that the element of the speech recognition result is the correct one) and 10 0000 represents a maximum value of a confidence rate (very high probability that the element of the speech recognition result is the correct one). As a function of the height of an element 113 in FIG. 2, its confidence level 160 is higher or lower.
Rather than determining the validity of an element 113 in an absolute way, the inventors propose to use the profile of the speaker 40 having said the message 50 from which the result is 100. In the example of FIG. 2, some elements 113 could be considered invalid by taking absolute criteria, common to all speakers 40. For example, the third element 113 starting from the beginning 111 could be determined invalid because having a lower confidence rate than a rate 161. The fourth element 113 could be considered invalid because having a duration less than a threshold of shorter duration. Even if such absolute criteria can be used in combination with the method of the invention, the inventors propose to compare the value of one or more voice recognition parameters of an element 113 of the result 100 with a value of the same value. (or the same) voice recognition parameter (s) which is predetermined for the same element 113 and for the same speaker 40.
For example, imagine that the speaker 40 is Mr X and he said a message 50 including among other things as the spoken word 'two'. Let us assume that a speech recognition engine 10 provides a result 100 from this message 50, comprising one element 113 'two'. The method of the invention requires to obtain a predetermined value of a statistic (for example the duration, the confidence rate) from the 'two' element for Mr X. Such a predetermined value can be obtained thanks to a preliminary step for example, in which the same speaker 40 said a calibration message 50 to a voice recognition engine 10, said calibration message 50 comprising the spoken word 'two'. Thanks to the result 100 of this calibration message 50 provided by the voice recognition engine 10, it is possible to deduce one (or more) value (s) from one (or more) associated voice recognition parameter (s) (s). ) to the word 'two'. The inventors propose to compare this (or these) predetermined value (s) of this or these parameter (s) of speech recognition with one (or more) value (s) of the same parameter (s). s) speech recognition for this word 'two'. If the value (s) are close (s), we can conclude that the word 'two' of the result of 100 speech recognition is valid. This does not prevent, according to certain variants of the invention, to use in combination certain absolute criteria, common to all the elements 113 and / or all the speakers 40, to determine that an element 113 of the result 100 of recognition voice is valid or not. Thus, one can for example reject any item 113 of the result 100 which has a duration greater than two seconds.
Preferably, the method of the invention comprises a step of providing, for different speakers 40 and for different elements 113, different predetermined values of one or more voice recognition parameters. Preferably, each of these predetermined values then corresponds to a single element 113 and to a single speaker 40. It is then a matter of selecting a predetermined value of a speech recognition parameter to determine the validity of an element 113 of FIG. a result 100 from a message 50 said by a speaker 40. This predetermined value of a speech recognition parameter is then compared with a value of the same voice recognition parameter for the element 113 of the result 100 whose validity is sought . If these two values are sufficiently close, which means for example that they have an absolute difference of less than 10% of the predetermined value of the speech recognition parameter, the element 113 is determined to be valid. For the implementation of this preferred variant, the method of the invention preferably comprises a preliminary step of voice recognition of a calibration message 50 said by several speakers 40 and comprising several spoken elements. For example, a calibration message 50 includes the following spoken elements: zero, one, two, three, four, five, six, seven, eight, nine, ten.
The inventors also propose a method of post-processing a result of 100 speech recognition. For this postprocessing method, the input corresponds to a result 100 resulting from the output of a speech recognition engine 10.
The first step of this post-processing method, the step i., Consists in receiving the result 100. Then, preferably starting from the end 112 of the result 100, the method will isolate a first element 113. According to this preferred variant, the post-processing method will therefore first isolate the last element 113 of the result 100 along the time scale, t. Once this element 113 has been chosen, the method determines whether it is valid using a validation test. The validation test may include several methods, multiple tests, or multiple steps. In any case, it will include at least one method as described above for determining whether a valid item 113 which uses a profile of the speaker 40 having said a message 50 from which the result 100 is derived. Other methods may be combined with this method during the validation test.
Then, a second element 113 starting from the end 112 is considered and so on. According to a possible version of the post-processing method, all the elements 113 of the result 100 are thus traversed along the arrow shown at the top of FIG. 2 and stopped when the first element 113 along the time scale , t, was determined valid or not. According to another preferred variant, the elements 113 of the result 100 are stopped running along the arrow at the top of FIG. 2 as soon as it has been detected that an element 113 is not valid. A post-processed solution 200 is then determined by resuming elements 113 which have been determined to be valid, preferably, using all elements 113 that have been determined valid. When determining the post-processed solution 200, it is necessary to keep the good order of the various elements 113 selected along a time scale, t. Thus, it should be taken into account that the first element 113 processed by the aftertreatment method represents, for certain preferred versions, the last element 113 of the message 100 and therefore must be found last in the post-processed solution 200. it has been determined to be valid. In general, a voice recognition engine 10 provides, with the different elements 113 of the message 100, associated time information, for example the beginning and the end of each element 113. This associated temporal information can be used to classify in the correct order the elements 113 determined valid in step iii.a., that is to say in a chronological increasing order.
[0063] Preferably, the post-processing method comprises a step of verifying that the post-processed solution 200 satisfies a grammar rule. An example of a grammar rule is a number of words. If the post-processed solution 200 does not satisfy such a grammar rule, it may be decided not to provide it. In this case, it is sometimes preferred to provide the result 100 of the voice recognition engine 10. If the post-processed solution 200 satisfies such a grammar rule, then it will be preferred to provide it.
FIG. 3 shows in schematic form a preferred version of the post-processing method where: - one stops isolating (or choose) an additional element 113 to make it undergo the validation test when it has been detected an invalid element 113, where - it is verified that the post-processed solution 200 satisfies a grammar rule (step vii.), where - the post-processed solution 200 is provided if it satisfies said grammar rule, and where - one provides the result 100 of the voice recognition engine 10 if the post-processed solution 200 does not satisfy said grammar rule.
Step iii.a is to determine whether an element 113 selected in step ii. is valid using a validation test. In addition to the method for determining the validity of an element 113 according to the first aspect of the invention, the validation test may comprise other steps in combination.
An element 113 is characterized by a beginning and an end. It therefore has a certain duration 150. According to one possible variant, the validation test comprises a step of considering a valid element 113 if its duration 150 is greater than or equal to a threshold of shorter duration. The threshold of shorter duration is for example between 50 and 160 milliseconds. Preferably, the lower duration threshold is 120 milliseconds. The lower duration threshold can be adapted dynamically. According to another possible variant, the validation test comprises a step of considering a valid element 113 if its duration 150 is less than or equal to a threshold of greater duration. The threshold of greater duration is for example between 400 and 800 milliseconds. Preferably, the threshold of greater duration is 600 milliseconds. The threshold of greater duration can be adapted dynamically. Preferably, the lower duration threshold and / or the higher duration threshold is / are determined by a grammar.
In general, a confidence level 160 is associated with each element 113. According to another possible variant, the validation test comprises a step of considering a valid element 113 if its confidence level 160 is greater than or equal to a rate The minimum confidence level 161 may preferably vary dynamically. In such a case, it is then possible that the minimum confidence level 161 used to determine whether an item 113 is valid is different from that used to determine whether another item 113 is valid or not. The inventors have found that a minimum confidence level 161 between 3500 and 5000 provided good results, a value that is still preferred being 4000 (values for the VoCon® 3200 V3.14 model of Nuance but which can be adapted to others. voice recognition engine models 10).
According to another possible variant, the validation test comprises a step of considering a valid element 113 if a time interval 170 separating it from another element 113 directly adjacent to the end 112 of the result 100 is greater than or equal to one. minimum time interval. Such a minimum time interval is for example between zero and fifty milliseconds. According to another possible variant, the validation test comprises a step of considering a valid element 113 if a time interval 170 separating it from another element 113 directly adjacent towards the end 112 of the result 100 is less than or equal to a time interval maximum. Such a maximum time interval is for example between 300 and 600 milliseconds and a preferred value is 400 ms. For these two examples of validation test, we therefore consider the time interval 170 which separates an element 113 from its direct neighbor to the right in FIG. 2. In other words, we look at the time interval that separates an element 113 of its direct right-hand neighbor, that is to say its posterior neighbor along the time scale, t. A time interval separating two elements 113 is for example a time interval during which a speech recognition engine 10 does not recognize any element 113, for example no word.
The inventors also propose a method for generating an optimized solution in voice recognition and comprising the following steps: receiving a plurality of voice recognition results 100, each comprising one or more elements 113; selecting one or more valid elements 113 of one or more results 100 from said plurality of results 100; generating said optimized solution from the at least one valid element 113 selected in the previous step. The at least one valid element 113 is determined as such by the method according to the first aspect of the invention.
According to a second aspect, the invention relates to a system 11 (or device) for determining the validity of an element 113 of a result 100 of speech recognition. Figure 4 schematically illustrates such a system 11 in combination with a voice recognition engine 10, an auxiliary device 15 and a screen 20. In this figure, the system 11 and the voice recognition engine 10 are two separate devices. According to another possible version, the system 11 is integrated into a voice recognition engine 10 so that it is not possible to differentiate them. In such a case, a conventional speech recognition engine 10 is modified or adapted to perform the functions of the system 11 described below. The auxiliary device 15 makes it possible to provide one (or more) predetermined value of one (or more) voice recognition parameter for an element 113 and for a speaker 40. In FIG. 4, this auxiliary device 15 is separated from the system 11. According to another possible version, the auxiliary device is a module that is integrated at the level of the system 11 so that it is not possible to differentiate them. It could also be provided that the auxiliary module 15 is integrated in the voice recognition engine 10. Preferably, the auxiliary module 15 is able to communicate with the voice recognition engine 10.
Examples of system 11 are: a computer, a speech recognition engine 10 adapted or programmed to perform a method according to the first aspect of the invention, a hardware module (or hardware) of a voice recognition engine 10, a hardware module capable of communicating with a voice recognition engine 10. Other examples are nevertheless possible. The system 11 comprises acquisition means 12 for receiving a value 140 of a voice recognition parameter, which is predetermined for said speaker 40 and for an element 13 which is identical to said element 113 of said voice recognition result 100 whose validity is sought. The acquisition means 12 are also able to receive a value of the same voice recognition parameter for the element 113 of said voice recognition result 100 whose validity is sought. In general, this value is provided by the voice recognition engine 10. Examples of acquisition means 12 are: an input port of the post-processing system 11, for example a USB port, an Ethernet port, a port wireless (eg WIFI). Other examples of acquisition means 12 are nevertheless possible.
The system 11 further comprises processing means 13 for determining that an item 113 of a speech recognition result 100 is valid if the value of one of its voice recognition parameters corresponds, within a range, to the predetermined value 140 provided by the auxiliary device 15, which is for example a memory module. Preferably, the system 11 is able to send the result of the validity determination of the element 113 to a screen 20 to display it.
Examples of processing means 13 are: a control unit, a processor or central processing unit, a controller, a chip, a microchip, an integrated circuit, a multi-core processor. Other examples known to those skilled in the art are nevertheless possible.
Preferably, the system 11 is also capable of performing a method of post-processing a result of voice recognition 100 as described above. In this case, the acquisition means 12 are preferably able to receive a result of voice recognition 100 and the processing means 13 are preferably able to apply such a method of post-processing.
According to a third aspect, the invention relates to a program, preferably a computer program. Preferably, this program is part of a human-machine voice interface.
According to a fourth aspect, the invention relates to a storage medium that can be connected to a device, for example a computer that can communicate with a voice recognition engine 10. According to another possible variant, this device is a motor voice recognition 10. Examples of storage medium according to the invention are: a USB key, an external hard disk, a CD-ROM type disk. Other examples are nevertheless possible.
The present invention has been described in relation to specific embodiments, which have a purely illustrative value and should not be considered as limiting. In general, the present invention is not limited to the examples illustrated and / or described above. The use of the verbs "to understand", "to include", "to include", or any other variant, as well as their conjugations, can in no way exclude the presence of elements other than those mentioned. The use of the indefinite article "a", "an", or the definite article "the", "the" or "I", to introduce an element does not exclude the presence of a plurality of these elements. The reference numerals in the claims do not limit their scope.
In summary, the invention can also be described as follows. A method of determining the validity of an item 113 of a speech recognition result 100 comprising the steps of: providing a value of a speech recognition parameter, which is predetermined for the speaker 40 and for a member 113 identical to said item 113 of said result 100 whose validity is sought; obtain from said result 100 a value of the same voice recognition parameter for said element 113 whose validity is sought; determining that said element 113 of said result 100 is valid if the value of the speech recognition parameter obtained from said result 100 is identical, within a range, to the predetermined value for said speaker 40 for this element 113.

权利要求:
Claims (15)
[1]
claims
A method for determining the validity of an item (113) of a speech recognition result (100), said result (100) being generated by a speech recognition algorithm applied to a speaker-said message (50) ( 40), said method comprising the steps of: a) obtaining a value of a voice recognition parameter, which is predetermined: for said speaker (40), and for an element (113) which is identical to said element (113). ) said result (100) whose validity is sought; b) obtaining a value of the same voice recognition parameter as that of step a) for said element (113) of said voice recognition result (100) whose validity is sought; c) comparing the values obtained in steps a) and b) to determine the validity of said element (113) of said voice recognition result (100).
[2]
2. Method according to the preceding claim, characterized in that it further comprises a preliminary stage of voice recognition of a calibration message (50) said by said speaker (40), said calibration message (50) comprising an element spoken identical to said element (113) of said voice recognition result (100) whose validity is sought.
[3]
3. Method according to any one of the preceding claims, characterized in that step c) comprises the following step c) 1): if the value of the voice recognition parameter obtained in step b) corresponds, in a within the predetermined range of the speech recognition parameter obtained in step a), determine that said element (113) of said speech recognition result (100) is valid.
[4]
4. Method according to any one of the preceding claims characterized in that said voice recognition parameter of step a) is a confidence rate.
[5]
5. Method according to any one of the preceding claims characterized in that said voice recognition parameter of step a) is a duration.
[6]
6. Method according to any one of the preceding claims characterized in that: - the method further comprises a step of providing, for an element (113) which is identical to said element (113) of said result (100) and for different speakers (40) including said speaker (40) of said message (50), different predetermined values of the same speech recognition parameter as that of step a), each of said different predetermined values being associated with each of said speakers (40); and in that - the method further comprises a step of selecting from among said different predetermined values of said speech recognition parameter one of which corresponds to the speaker (40) of said message (50), to obtain said predetermined value of step a).
[7]
7. Method according to any one of claims 1 to 5 characterized in that: - the method further comprises a step of providing, for said speaker (40) and for various elements (113) including said element (113) whose validity is sought, different predetermined values of the same voice recognition parameter as that of step a), each of said different predetermined values being associated with each of said different elements (113); and in that the method of the invention further comprises a step of selecting from among said different predetermined values of said speech recognition parameter one of which corresponds to an element (113) which is identical to said element ( 113) of said voice recognition result (100), to obtain said predetermined value of step a).
[8]
8. Method according to any one of claims 1 to 5 characterized in that: - the method further comprises a step of providing, for different speakers (40) including said speaker (40) of said message (50) and for different elements (113) of which said element (113) whose validity is sought, different predetermined values of the same voice recognition parameter as that of step a), each of said different predetermined values being associated with only one of said different speakers (40) and at only one of said different elements (113); and in that the method of the invention further comprises a step of selecting from among said different predetermined values of said speech recognition parameter one of which corresponds to the speaker (40) of said message (50) and an element (113) which is identical to said element (113) of said voice recognition result (100), to obtain said predetermined value of step a).
[9]
A method of post-processing a voice recognition result (100), said result (100) comprising a start (111), an end (112), and a plurality of elements (113) split between said beginning (111) ) and said ending (112), said post-processing method comprising the following steps: i. receiving said result (100); ii. isolating an element (113) from said plurality of elements (113) that has not passed the validation test of step iii.a. ; iii. then. if an element (113) has been isolated in step ii., determining whether it is valid using a validation test, b. otherwise, go directly to step v. ; iv. repeat steps ii. and iii .; v. if at least one element (113) has been determined valid in step iii.a, determining a post-processed solution (200) using at least one determined element (113) valid in step iii.a; characterized in that said validation test of step iii.a., to determine if the element (113) isolated in step ii. is valid, includes a method according to any one of the preceding claims.
[10]
10. Method of post-processing a result (100) of speech recognition according to the preceding claim characterized in that each element (113) isolated in step ii. is selected from said end (112) of the result (100) at said beginning (111) of the result (100) consecutively.
[11]
11. A method of post-processing a voice recognition result (100) according to any one of claims 9 to 10, characterized in that said validation test of step iii.a. includes a step of considering a valid element (113) if its duration is greater than or equal to a threshold of shorter duration.
[12]
A method of post-processing a speech recognition result (100) according to any one of claims 9 to 11 characterized in that said validation test of step iii.a. includes a step of considering a valid item (113) if its confidence level (160) is greater than or equal to a minimum confidence level (161).
[13]
A method for generating an optimized voice recognition solution and comprising the steps of: - receiving a plurality of voice recognition results (100) each comprising one or more elements (113); selecting one or more valid elements (113) of one or more results (100) of said plurality of results (100), said one or more elements (113) being determined valid by any one of claims 1 to 8; ; generating said optimized solution from the at least one valid element (113) selected in the preceding step.
[14]
14. System (11) for determining the validity of an element (113) of a speech recognition result (100), said result (100) being derived from a speech recognition algorithm applied to a message (50) said by an orator (40), said system (11) comprising: - acquisition means (12) for receiving a value of a speech recognition parameter, which is predetermined for said speaker (40) and for an element (13) ) which is identical to said element (113) of said voice recognition result (100) whose validity is sought; acquisition means (12) for receiving a value of the same voice recognition parameter for the element (113) of said voice recognition result (100) whose validity is sought; processing means (13) for comparing the values received by the acquisition means (12) to determine the validity of said element (113) of said voice recognition result (100).
[15]
A program for determining the validity of an item (113) of a speech recognition result (100), said result (100) being derived from a speech recognition algorithm applied to a message (50) said by a speaker (40), said program comprising a code for enabling a device to perform the following steps: a) reading a value of a speech recognition parameter that is predetermined for said speaker (40) and for a speaker (113). ) which is identical to said element (113) of said voice recognition result (100); b) reading a value of the same speech recognition parameter as that of step a) for said element (113) of said voice recognition result (100); c) comparing the values obtained in steps a) and b) to determine the validity of said element (113) of said voice recognition result (100).

类似技术:

公开号 | 公开日 | 专利标题

US9405741B1|2016-08-02|Controlling offensive content in output

EP1362343B1|2007-08-29|Method, module, device and server for voice recognition

EP1606796B1|2009-06-03|Distributed speech recognition method

EP0867856A1|1998-09-30|Method and apparatus for vocal activity detection

EP1154405A1|2001-11-14|Method and device for speech recognition in surroundings with varying noise levels

WO2004006222A2|2004-01-15|Method and apparatus for classifying sound signals

EP1585110A1|2005-10-12|System for speech controlled applications

FR2743238A1|1997-07-04|TELECOMMUNICATION DEVICE RESPONDING TO VOICE ORDERS AND METHOD OF USING THE SAME

CN108039181B|2021-02-12|Method and device for analyzing emotion information of sound signal

EP2772916A1|2014-09-03|Method for suppressing noise in an audio signal by an algorithm with variable spectral gain with dynamically scalable hardness

EP1647897A1|2006-04-19|Automatic generation of correction rules for concept sequences

BE1023427B1|2017-03-16|Method and system for determining the validity of an element of a speech recognition result

BE1023458B1|2017-03-27|Method and system for generating an optimized voice recognition solution

BE1023435B1|2017-03-20|Method and system for post-processing a speech recognition result

EP1723635A1|2006-11-22|Method for automatic real-time identification of languages in an audio signal and device for carrying out said method

EP1285435B1|2007-03-21|Syntactic and semantic analysis of voice commands

EP3627510A1|2020-03-25|Filtering of an audio signal acquired by a voice recognition system

EP1981021A1|2008-10-15|Method for estimating the mental health of a person

FR3111004A1|2021-12-03|Method of identifying a speaker

WO2021121784A1|2021-06-24|Method for identifying at least one person on board a motor vehicle by voice analysis

EP1665231B1|2008-03-05|Method for unsupervised doping and rejection of words not in a vocabulary in vocal recognition

EP3319085A1|2018-05-09|Method and system for user authentication by voice biometrics

FR3105499A1|2021-06-25|Method and device for visual animation of a voice control interface of a virtual personal assistant on board a motor vehicle, and a motor vehicle incorporating it

FR2988894A1|2013-10-04|Method for detection of voice to detect presence of word signals in disturbed signal output from microphone, involves comparing detection function with phi threshold for detecting presence of absence of fundamental frequency

FR2966635A1|2012-04-27|Method for displaying e.g. song lyrics of audio content under form of text on e.g. smartphone, involves recognizing voice data of audio content, and displaying recognized voice data in form of text on device

同族专利:

公开号 | 公开日

EP3065132A1|2016-09-07|

BE1023427A1|2017-03-16|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

EP1067512A1|1999-07-08|2001-01-10|Sony International GmbH|Method for determining a confidence measure for speech recognition|

US20020133346A1|2001-03-16|2002-09-19|International Business Machines Corporation|Method for processing initially recognized speech in a speech recognition session|

US8874438B2|2004-03-12|2014-10-28|Siemens Aktiengesellschaft|User and vocabulary-adaptive determination of confidence and rejecting thresholds|

US20070050190A1|2005-08-24|2007-03-01|Fujitsu Limited|Voice recognition system and voice processing system|

US20120209609A1|2011-02-14|2012-08-16|General Motors Llc|User-specific confidence thresholds for speech recognition|

US6754629B1|2000-09-08|2004-06-22|Qualcomm Incorporated|System and method for automatic voice recognition using mapping|

US20140278418A1|2013-03-15|2014-09-18|Broadcom Corporation|Speaker-identification-assisted downlink speech processing systems and methods|

法律状态:

优先权:

申请号 | 申请日 | 专利标题

EP15157920.8|2015-03-06|

EP15157920.8A|EP3065132A1|2015-03-06|2015-03-06|Method and system for determining the validity of an element of a speech recognition result|

[返回顶部]