专利摘要:
Some known medical terms may function as non-medical terms depending on their particular context. Accordingly, the present inventors devised systems, methods, and software that facilitate determining whether a term that is found in a medical corpus is 5 likely to be a medical term when found in another corpus. An exemplary embodiment receives a term and computes an ambiguity score based on language models for a medical and a non-medical corpus.
公开号:AU2013213681A1
申请号:U2013213681
申请日:2013-08-02
公开日:2013-08-22
发明作者:Mark Chaudhary;Christopher C. Dozier;Ravi Kondadadi
申请人:Thomson Reuters Global Resources ULC;
IPC主号:G06F17-30
专利说明:
Systems, Methods, and Software For Assessing Ambiguity of Medical Terms 5 Copyright Notice and Permission A portion of this patent document contains material subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the 10 Patent and Trademark Office patent files or records, but otherwise reserves all copyrights whatsoever. The follow ing notice applies to this document: Copyright D 2005-2006, Thomson Global Resources. Cross-Reference to Related Application is This application claims priority to U.S. provisional application 60/723,483 filed on October 4, 2005. The provisional application is incorporated herein by reference. Technical Field 20 Various embodiments of the present invention concern systems, methods, and software for identifying medical content in documents and linking those documents to other documents based on the medical content. Background 25 The fantastic growth of the Internet and other computer networks has fueled an equally fantastic growth in the data accessible via these networks. One of the seminal modes for interacting with this data is through the use of hyperlinks within electronic documents. Hyperlinks are user-selectable elements, such as highlighted text or 30 icons, that link one portion of an electronic document to another portion of the sane document or to other documents in a database or computer network. With 2 proper computer equipment and network access, a user can select or invoke a hyperlink and almost instantaneously view the other document, which can be located on virtually any computer system in the world. Although many hyperlinks are created and inserted into documents 5 manually, recent years have seen development of automated techniques for identifying specific types of document text and linking the identified text using hyperlinks to other related documents. For example, to facilitate legal research, the Westlaw legal research system automatically identifies legal citations and attorney names in text and links the citations to corresponding legal documents in 10 a database and the attorney names to biographical entries in an online directory. For further details, see U.S. Patent 7,003,719 and U.S. Published Patent Application US2003/0135826A1, both of which are incorporated herein by reference. Although the automated linking technology in the Westlaw system is 15 highly effective for legal citations and names, the present inventors have identified that this technology is not well suited for other types of content, such as medical terms. For example, the inventors recognize that identifying legal citations and entity names within a text is generally simpler than identifying medical terms because terms may function as medical terms in one context and as non-medical 20 terms in another. Legal citations and person names, on the other hand, generally function as legal citations and person names regardless of context. Accordingly, the present inventors have identified a need for automated methods identifying whether terms are medical terms or non-medical terms. A reference herein to a patent document or other matter which is given as 25 prior art is not to be taken as an admission or a suggestion that the document or matter was known or that the information it contains was part of the common general knowledge as at the priority date of any of the claims. Summary One aspect of the invention provides a computer-implemented method for 30 inserting in a document a hyperlink to a medical document associated with a medical term in the document, the method comprising: receiving a term included within a document; determining an ambiguity score for the term based on a 2a probability of the term being a medical term using first and second language models for respective medical and non-medical corpuses of documents, determining an ambiguity score including: determining a first conditional probability based on a first language model related to a medical corpus of documents; determining a second conditional 5 probability based on a second language model related to a non-medical corpus of documents; and determining an aggregate conditional probability representing the first and second conditional probabilities; and determining, based on the ambiguity score, whether to insert a hyperlink in the document to a medical document associated with the term; and inserting the hyperlink in the document if the determination to insert the 10 hyperlink is affirmative. Another aspect of the invention provides a computerized system comprising: an input for receiving a set of terms; a processor for executing code adapted to determine an ambiguity score for a term from the set of terms based on a probability of the term being a medical term using first and second language models for respective medical and non 15 medical corpuses of documents; and means for determining, based on the ambiguity score, whether to insert a hyperlink in a first document including the term to a medical document associated with the term; and means for inserting the hyperlink in the first document if the determination to insert the hyperlink is affirmative. Another aspect of the invention provides a computer-readable medium 20 comprising: a code set configured to receive by a computer a term in a document; a code set configured to determine by the computer an ambiguity score for the term based on a probability of the term being a medical term, wherein the ambiguity score is based on first and second language models for respective medical and non-medical corpuses of documents; and a code set configured to output by the computer the ambiguity score, 25 whereby a hyperlink to a medical document associated with the term is inserted into the document if a determination to insert the hyperlink based on the ambiguity score is affirmative.
3 Brief Description of Drawings Figure 1 is a block diagram of an exemplary system 100 which corresponds to one or more embodiment of the present invention. Figure 2 is a flow chart of an exemplary method of operating system 100 5 which corresponds to one or more embodiments of the invention. Detailed Description of Exemplary Embodiments The following detailed description, which references and incorporates Figures 1 and 2, describes and illustrates one or more exemplary embodiments of 10 the invention. These embodiments, offered not to limit but only to exemplify and teach the invention, are shown and described in sufficient detail to enable those skilled in the art to make and use the invention, Thus, where appropriate to avoid obscuring the invention, the description may omit certain information known to those of skill in the art. 15 Exemplary Computer System Embodying the Invention Figure 1 shows a diagram of an exemplary computer system 100 incorporating a system, method, and software for assessing the ambiguity of terms, such as medical terms. Though the exemplary system is presented as an 20 interconnected ensemble of separate components, some other embodiments implement their functionality using a greater or -lesser number of components. Moreover, some embodiments intercouple one or more the components through wired or wireless local- or wide-area networks. Some embodiments implement one or more portions of system 100 using one or more mainframe computers or 25 servers.) Thus, the present invention is not limited to any particular functional partition. Generally, system 100 includes input terms 110, term-ambiguity calculator 120, and ambiguity scores output 130. Input terms 110 includes one or more terms, such as a set of terms from a ,30 medical database. In the exemplary embodiment, input terms 110 includes terms from the Unified Medical Language System (UMLS). The table below shows 4 that UMLS includes a great number of terms in disease, injury, medical procedure, body part, and drug categories. Category Terms Concepts Disease 189,712 69,948 Injury 42,141 28,997 Medical 134,179 72,918 procedure Bodypart. 38,041 22,260 Drugs 244,752 129,959 s In some embodiment, input terms 110 are terms extracted from one or more input documents, such as an electronic judicial opinion. or other type legal document. Coupled to database 110 is term-ambiguity calculator 120. Calculator 120 includes one or more conventional processors 121, display device122, 10 interface devices 123, network-communications devices 124, and memory 125. Memory 125, which can take a variety of forms, such as coded instructions or data on an electrical, magnetic, and/or optical carrier medium, includes term ambiguity software 126. Term-ambiguity software 126 includes various software and data components, for determining or calculating for each input tenn 15 t and ambiguity score, Score(tern) defined as Score(terrn) log(P(t News lang)) log(P(t Legal_ lang)) log(P(t I UMLS lang)) log(P(t I UMLS - lang)) where 20 log(P(t lang)) log(P(ngram I lang)) and lamdal and lamda2 are constants, which in some embodiments are used to normalize or smooth the scoring function. In some embodiments, lambdal and 5 lamda2 are set to 0.5. The exemplary embodiment uses ngram backoff with Witten Bell smoothing to smooth the language models. The exemplary scoring function is based on the intuition that medical ngrams, such as "hepatic," occur relatively more often in UMLS than in news or 5 legal and that ngrams such as "drinki" will occur relatively more often in news or legal than in UMLS. Terms having grams that are more highly predicted by UMLS than news or legal tend to yield a larger score and thus indicate that the given term is more likely a medical term than not a medical term when found in a news or legal document. 10 Term-ambiguity calculator 120 outputs a set 130 of one or more ambiguity scores based on the input terms. (Figure 1 shows that the input terms 110 and output scores 130 are also retained in memory 130.) In the exemplary embodiment, the scores are output as a ranked list, with each score associated with corresponding terms. (Note that tenn may include one or more words.) 15 The ambiguity Scores can be used for a variety of purposes, including for example determining whether it is appropriate to insert a link in a document including a given term back to a ULMS document associated with the term. For example, in the output terms shown the terms having an ambiguity score greater than 1.5 may be considered as clearly being medical terms and thus linked with 20 high confidence back to related ULMS documents. On the other hand, terms such as "word salad" or "anticipatory vomiting" that have lower scores should not generally be linked back to a related ULMS document without contextual corroboration, Exemplary Operation of System 100 25 Figure 2 shows a flowchart 200 illustrating an exemplary method of operating system 100. Flow chart 200 includes process blocks 210-230. Though these blocks (and those of other flow charts in this document) are arranged serially in the exemplary embodiment, other embodiments may reorder the blocks, omit one or more blocks, and/or execute two or more blocks in parallel 30 using multiple processors or a single processor organized as two or more virtual machines or subprocessors. Moreover, still other embodiments implement the blocks as one or more specific interconnected hardware or integrated-circuit 6 modules with related control and data signals communicated between and through the modules. Thus, this and other exemplary process flows in this document are applicable to software, firmware, hardware, and other types of implementations. Block 210 entails receiving a set of terms. In the exemplary embodiment, this 5 entails receiving a set of terms from ULMS or an input news or legal document into memory 126 of term-ambiguity calculator 120. Execution continues at block 220. Block 220 entails determining one or more ambiguity scores for one or more of the input terms. In the exemplary embodiment this entails computing ambiguity scores according to the definition set forth above for Score(term) in equation above, which 10 provides a sum of two conditional probability ratios. Each conditional probability is based on language model of set or corpus of documents. In some embodiments, one of the conditional probability ratios is omitted from the scoring function. Also, in some embodiments, the conditional probability ratios are inverted. Block 230 entails outputting one or more of the determined ambiguity scores. In 15 the exemplary embodiment, this entails outputting in printed or other human readable form; however, in other embodiments, the output may also be used by another machine, component, or software module, or simply retained in memory. Conclusion The embodiments described above are intended only to illustrate and teach one 20 or more ways of practicing or implementing the present invention, not to restrict its breadth or scope. The actual scope of the invention, which embraces all ways of practicing or implementing the teachings of the invention, is defined only by the following claims and their equivalents. Throughout the description and claims of this specification, the word "comprise" 25 and variations of the word, such as "comprising" and "comprises", is not intended to exclude other additives, components, integers or steps. The discussion of documents, acts, materials, devices, articles and the like is included in this specification solely for the purpose of providing a context for the present invention. It is not suggested or represented that any or all of these matters formed part 30 of the prior art base or were common general knowledge in the field relevant to the present invention as it existed before the priority date of each claim of this application.
权利要求:
Claims (20)
[1] 1. A computer-implemented method for inserting in a document a hyperlink to a medical document associated with a medical term in the document, the 5 method comprising: receiving a term included within a document; determining an ambiguity score for the term based on a probability of the term being a medical term using first and second language models for respective medical and non-medical corpuses of documents, determining an 10 ambiguity score including: determining a first conditional probability based on a first language model related to a medical corpus of documents; determining a second conditional probability based on a second language model related to a non-medical corpus of documents; and 15 determining an aggregate conditional probability representing the first and second conditional probabilities; and determining, based on the ambiguity score, whether to insert a hyperlink in the document to a medical document associated with the term; and 20 inserting the hyperlink in the document if the determination to insert the hyperlink is affirmative.
[2] 2. The computer-implemented method of claim 1, wherein the second language model is based on a legal or general news corpus of documents. 25
[3] 3. The computer implemented method of either one of claims 1 or 2, wherein the ambiguity score is based on a ratio of a probability of the term given a non-medical corpus to a probability of the term given a medical corpus. 30
[4] 4. A computerized system comprising: an input for receiving a set of terms; a processor for executing code adapted to determine an ambiguity score for a term from the set of terms based on a probability of the term 8 being a medical term using first and second language models for respective medical and non-medical corpuses of documents; and means for determining, based on the ambiguity score, whether to insert a hyperlink in a first document including the term to a medical 5 document associated with the term; and means for inserting the hyperlink in the first document if the determination to insert the hyperlink is affirmative.
[5] 5. The system of claim 4, wherein the second language model is based on a 10 legal or general news corpus of documents.
[6] 6. The system of either one of claims 4 or 5, wherein each ambiguity score is based on a ratio of a probability of the term given a non-medical corpus to a probability of the term given a medical corpus. 15
[7] 7. The method of any one of claims I to 3, wherein first language model is based on the Unified Medical Language System (UMLS).
[8] 8. The method of any one of claims I to 3, further comprising: 20 determining that the ambiguity score meets a specified level.
[9] 9. The method of any one of claims I to 3, further comprising: linking the term to a related document. 25
[10] 10. The system of any one of claims 4 to 6, wherein first language model is based on the Unified Medical Language System (UMLS).
[11] 11. The system of any one of claims 4 to 6, further comprising: determining that the ambiguity score meets a specified level. 30
[12] 12. The system of any one of claims 4 to 6, further comprising: linking the term to a related document. 9
[13] 13. A computer-readable medium comprising: a code set configured to receive by a computer a term in a document; a code set configured to determine by the computer an ambiguity score for the term based on a probability of the term being a medical term, 5 wherein the ambiguity score is based on first and second language models for respective medical and non-medical corpuses of documents; and a code set configured to output by the computer the ambiguity score, whereby a hyperlink to a medical document associated with the term is inserted into the document if a determination to insert the hyperlink based 10 on the ambiguity score is affirmative.
[14] 14. The computer-readable medium of claim 13, wherein the first language model is based on a medical corpus of documents and the second language model is based on a legal or general news corpus of documents. 15
[15] 15. The computer-readable medium of either one of claims 13 or 14, wherein each ambiguity score is based on a ratio of a probability of the term given a non-medical corpus to a probability of the term given a medical corpus. 20
[16] 16. The computer-readable medium of any one of claims 13 to 15, wherein first language model is based on the Unified Medical Language System (UMLS).
[17] 17. The computer-readable medium of any one of claims 13 to 16, further comprising: 25 determining that the ambiguity score meets a specified level.
[18] 18. The system of any one of claims 13 to 17, further comprising: linking the term to a related document. 30
[19] 19. The method of any one of claims I to 3 or 7 to 9, wherein the ambiguity score is based on 10 log(P(t I News Langg)) log(P(t I Legal lang)) log(P(i I UMLS Langg)) log(PQ I UMLS _lang)) where t denotes the term; News lang denotes a news corpus; UMLSlang denotes a medical corpus; Legal lang denotes a legal corpus; 2 and 2 are constants; and log(P(I I lang)) = $log(P(ngram I lang)) 5 where lang is a placeholder for the corpus of interest and n denotes the number of words in the term t.
[20] 20. The system of any one of claims 4 to 6 or 10 to 12, wherein the means for 10 determining the ambiguity score is adapted to determine the ambiguity score is based on log(P(i I News _ lang)) log(P(t I Legal lang)) log(P(t I UMLS Langg)) log(P( I UMLS lang)) where t denotes the term; Newslang denotes a news corpus; UMLSlang denotes a medical corpus; Legal lang denotes a legal corpus; A,, and 2 are 15 constants; and log(P(t I lang)) = log(P(ngran I lang)) where lang is a placeholder for the corpus of interest; and n denotes the number of words in the term t. 20
类似技术:
公开号 | 公开日 | 专利标题
CA2624816C|2016-01-26|Systems, methods, and software for assessing ambiguity of medical terms
US7333966B2|2008-02-19|Systems, methods, and software for hyperlinking names
KR100996311B1|2010-11-23|Method and system for detecting spam user created contentucc
US8825471B2|2014-09-02|Unsupervised extraction of facts
US8788260B2|2014-07-22|Generating snippets based on content features
US20140379743A1|2014-12-25|Finding and disambiguating references to entities on web pages
CN101501630B|2013-07-03|Method for ranking computerized search result list and its database search engine
US20020133483A1|2002-09-19|Systems and methods for computer based searching for relevant texts
US20190108183A1|2019-04-11|Title rating and improvement process and system
Grouin et al.2009|Testing tactics to localize de-identification
JP2009122807A|2009-06-04|Associative retrieval system
AU2013213681B2|2016-07-07|Systems, methods, and software for assessing ambiguity of medical terms
Rodriguez et al.2019|Non–English language availability of community health center websites
JP2008112310A|2008-05-15|Retrieval device, information retrieval system, retrieval method, retrieval program and recording medium
Thelwall2005|Text characteristics of English language university web sites
EA002016B1|2001-10-22|A method of searching for fragments with similar text and/or semantic contents in electronic documents stored on a data storage devices
JP2009098932A|2009-05-07|Associative retrieval system
Zhou et al.2019|Testing and Evaluating SNOMED CT Web Browsers' Textual Search Feature
JP6764991B1|2020-10-07|Sentence extraction system, sentence extraction method, and program
Ersgard et al.2014|Effectiveness of discharge interventions on readmissions for patients with chronic obstructive pulmonary disease: a systematic review protocol
Cusimano et al.2021|Adverse Fetal Outcomes and Maternal Mortality Following Non-Obstetric Abdominopelvic Surgery in Pregnancy: A Systematic Review and Meta-Analysis
Schares2008|Corpus-Based Word Information Systems for the GermanLanguage on the Internet
Adlassnig2009|Testing Tactics to Localize De-Identification
同族专利:
公开号 | 公开日
AU2013213681B2|2016-07-07|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题
法律状态:
2016-11-03| FGA| Letters patent sealed or granted (standard patent)|
2017-12-21| HB| Alteration of name in register|Owner name: THOMSON REUTERS GLOBAL RESOURCES UNLIMITED COMPANY Free format text: FORMER NAME(S): THOMSON REUTERS GLOBAL RESOURCES |
2020-04-23| PC| Assignment registered|Owner name: THOMSON REUTERS ENTERPRISE CENTRE GMBH Free format text: FORMER OWNER(S): THOMSON REUTERS GLOBAL RESOURCES UNLIMITED COMPANY |
优先权:
申请号 | 申请日 | 专利标题
US60/723,483||2005-10-04||
AU2011202308A|AU2011202308A1|2005-10-04|2011-05-18|Systems, methods, and software for assessing ambiguity of medical terms|
AU2013213681A|AU2013213681B2|2005-10-04|2013-08-02|Systems, methods, and software for assessing ambiguity of medical terms|AU2013213681A| AU2013213681B2|2005-10-04|2013-08-02|Systems, methods, and software for assessing ambiguity of medical terms|
[返回顶部]