![]() METHOD AND SYSTEM FOR INTELLIGENTLY INDEXING DOCUMENTS OR TEXT IN A COMPLEX DATABASE.
专利摘要:
Method for retrieving documents or texts in a complex database, characterized in that the retrieval is done on the basis of indexes that are created on the basis of one or more "causal" relationships between the text data of the relevant documents or texts, whether or not supplemented in the full texts with indexes based on the frequency or occurrence of words from a thesauri in the full texts. 公开号:BE1018996A3 申请号:E2009/0678 申请日:2009-11-03 公开日:2011-12-06 发明作者: 申请人:Group Dado 13 Bv Met Beperkte Aansprakelijkheid; IPC主号:
专利说明:
Method and system for intelligent indexing of documents or texts in a complex database. This invention relates to a method for intelligently indexing documents or texts in a complex database, as well as to a system for realizing this method. In the first place, the invention is intended to be used in data systems in which documents or texts are included, which must be able to be retrieved later on the basis of search criteria. More generally, however, the invention can be applied to any complex database in which documents or texts are stored. In particular, the invention contemplates a method for quickly finding documents or texts by means of a combination of several search criteria. Indexing and subsequently finding data from documents or texts is generally known when setting up databases. In general it can be said that there are three methods for converting textual data into indexes. The first method is the simple or automatic "non-intelligent" indexing method. According to this method, words are automatically extracted from a text by means of an evaluation system and these words are integrated into an index. The second method is the "manually-intelligent" indexing method. This method consists in that the person who indexes the documents or texts assigns one or more labels to each document, on the basis of which the document can be found afterwards. The third method is the automatic or "intelligent" indexing method. The addition of labels from the second method has been replaced by an automatic system. It is clear that the quality of the method with which the correct documents or texts can be found afterwards in a minimum of time depends on the criteria used for indexing. Among these criteria, one can mainly recognize two basic criteria. A first basic criterion relates to the "exhaustibility", which means to what extent the content of a certain document is fully recorded by means of the index. A second basic criterion is the specificity, which is determined for the precision with which requested documents or texts can be found. After all, the time needed to find the right documents or texts depends on the method in which the indexes are chosen. To reduce the search time, it is therefore necessary to strike an optimal balance between the possibility of finding the documents or texts and the precision with which they can be found. Hereby it is very important not to exhaustively create the indexes in order to avoid that during the search of certain documents or texts that relate to a certain subject much valuable information will come to the fore. In such a case it is said that the retained documentation contains a lot of "noise". High precision means that only useful information is indexed by assigning very precise labels. The indexing is usually done either by means of "single-term indexing", whereby indexes are assigned to single terms, that is to say words, or by means of "term relationship indexing", whereby indexes are assigned that take into account relationships between different concepts. The known systems for indexing documents or texts have the disadvantage that they are all primarily based on statistical formulas and that they do not use the addition of indexes based on knowledge. The present invention contemplates a method for automatically creating the indexes of documents or texts in such a way that it allows the retrieval of documents or texts in a more efficient manner, so that the user can obtain the requested information very quickly with the additional advantage that: the search is done with great precision, without any noticeable useless information or "noise". To achieve this goal, the present invention first of all provides a method for finding documents or texts in a complex database, wherein criteria are used for finding one or more relationships between the text data of the relevant data. documents or texts, characterized in that the aforementioned relationships consist of causal relationships. The aforementioned causal relationships are used to assign indexes to the documents or texts entered and the same causal relationships are used when searching, to automatically search for causal or other relationships based on these indexes. By establishing relationships, and more specifically causal relationships, the advantage is obtained that the semantic richness of a thesaurus can be optimally used for indexing documents or texts and / or for retaining documents or texts from a database during a search. . Preferably, in connection with the present invention, use will be made of one or more subject-oriented thesauri, more particularly thesauri relating to a specific field. In a preferred embodiment, in addition to the aforementioned thesaurus or thesauri, a file will also be built up and / or applied in which causal relationships are recorded. This helps end users to look for causes and / or relationships in certain contexts. The basic concept of the aforementioned method of the invention can be realized in practice in various ways. With the insight to better demonstrate the characteristics of the invention, a practical and preferred embodiment is described below as an example. According to this preferred embodiment, use is made of a structure in which essentially five basic components are recognizable. The first component is a synonym database with synonyms and related words. This synonym database is customizable and allows you to record new words, as well as new synonyms and equivalent terms. The second component is formed by a language parser that allows for a syntactic analysis. The purpose of this section is to analyze new documents or texts in order to automatically index them on the basis of semantic relationships in function of the specific specialization of the documents or texts. The language parser will automatically generate relevant indexes for each document or text. The third component is the interactive interrogation component. This section allows the user to enter a number of questions. These questioning means ensure that the system can always check how many documents or texts are retained when entering a certain index, as well as how many hits are found when combining different questions. The fourth component is formed by a search method based on "causal" or relationship indexes. Relationship indexes are search terms entered by a user based on knowledge of the subject area. The relationship indexes are compositions of indexes that are associated with causal relationships. These ensure that a more specific question can be asked and eliminates useless documents or texts from the result. The search result thus has less "noise". The fifth component is formed by the interactive interrogation component based on the relative relationships. This section allows the user to enter a number of questions. These questioning means ensure that the system can always check how many documents or texts are retained when entering a certain index, as well as how many hits are found when combining different questions. They ensure that the user can search on the basis of relative relationships between documents or texts that are indexed on the basis of relative relationships. The known search algorithms applied to this set of components ensure that the search operations can be performed quickly and efficiently in the specific subject area. An application of the method is described in detail to clarify the invention. Since synonyms and related terms are language specific, the thesaurus is drawn up in one specific language. It is therefore necessary to build a thesaurus for the discipline and in the predetermined language. In this thesaurus important terms from the discipline are stored and a number of synonyms or related terms per term. For example, when searching, only a limited file of search terms must be searched for, which increases the search speed. The following processes are applied to a text or document to be indexed using this invention: (a) thesaurus-based indexation, (b) operator intervention in indexing, and (c) causal indexation. The indexing based on the thesaurus (a) is done automatically by letting a program run through the text or the document and by identifying the terms that occur in the thesaurus as an index. Here the synonyms or related terms are not saved as a separate index. The indexes that have not yet been included in the thesaurus are shown to the operator and then require the intervention of the operator when indexing (b). This will check whether the indexes are relevant. The irrelevant are not used. He will add the relevant terms, either as a synonym or related term to an existing index in the thesaurus, or as a new index. In this way the thesaurus grows in number of indexes, but also in synonyms and related terms. In addition to the usual word-based indexing, a causal indexation (c) is added. This indexation is based on relationships and relationships that are described in the text and that will identify relationships of the type "from A and B follows C" or "from A and not B follows D" in the text or document. This indexation will allow you to search for relationships afterwards. The indexation takes place, on the one hand, on the basis of the order of certain conjunctions and phrases in a sentence, to which a connection can be recognized, such as: - If / if / at ... and ... then .... - When using ... and ... then .... - when use is made of ... and ... then .... -... and ... has the consequence that .... -... and ... generates .... Such relationships are stored as a multiple index in a relationship thesaurus. The multiple index is hereby based on the indexes from the thesaurus, so that the search can also include the synonyms and related terms in the result of each of the components of the index. The operator has the option of recognizing his own relationships and assigning them to a text or document. When searching in the database of texts or documents, the user thus has the possibility to perform a search for words from the text, whereby the result also includes the synonyms and related terms. In addition, he can have a search option that allows to search for relationships between terms. This option allows the user to ask a question of the type "if A and B, then C", ... The system will then extract the indexes based on the thesaurus and based on these indexes in the relationship -thesaurus search for a result. When displaying the result, it is possible to opt to display only the texts or documents with the multiple indexes or also the texts and documents for which partly the multiple index is only partly assigned. This ancillary option is needed to give the user the possibility to find answers to open questions in the database in addition to the texts and documents on a topic. In the example described, the thesaurus is laid out in one language. By turning the thesaurus into a translation in another language, the texts and documents also become available in other languages, without additional indexing of the texts and documents. This invention is not limited to the example described but also applies to systems that use the method.
权利要求:
Claims (14) [1] Method for faster finding of documents or texts in a complex database, characterized in that the display of worthless information or "noise" is avoided by creating indexes based on one or more "causal" relationships between the text data of the concerning documents or texts, where the causal relationships are made up of indexes from the thesaurus, replacing the synonyms and related terms. [2] Method according to claim 1, characterized in that an additional index is built up on the basis of the number of times the information is consulted, so that the display can take place in function of this frequency. [3] Method according to claim 1, characterized in that an additional index is built up, namely the importance of the information, so that the information can be displayed in function of this additional criterion. [4] Method according to claim 1, characterized in that the indexes and search criteria are based on the complete texts of said documents or texts. [5] Method according to claim 4, characterized in that at least a filtering on the text data is carried out by eliminating stop words and determining explicit index terms with the aid of the unigrams and / or bigrams and / or trigrams occurring in the text. [6] Method according to claim 5, characterized in that the bigrams and / or trigrams are built up by determining, after the stop words have been removed, and starting from the retained unigrams, which terms are adjacent thereto. [7] Method according to claim 2, characterized in that use is made of the indexation on the basis of the full text excluding the stop words. [8] Method according to claim 7, characterized in that use is made of the additional index of the importance of the information. [9] Method according to claim 1, characterized in that an additional index is made on the basis of the complete documents or texts, but limited to explicit index terms by comparing them with the content of a thesaurus. [10] Method according to claim 9, characterized in that a list is created for updating the thesaurus of the terms that do not occur in the thesaurus. [11] Method according to one of claims 5 to 8, characterized in that the indexing uses implicit index terms that are added to the explicit index terms, which added terms are taken from the thesaurus, these terms being both narrower and broader. terms can be. [12] Method according to claim 10, characterized in that use is made of means that allow an interactive update by the user. [13] Method according to one of claims 2 to 12, characterized in that when indexing a document, the number of index terms is limited to a maximum of five. [14] Method according to one of claims 1 to 13, characterized in that use is made of the combination of a thesaurus and a limitation of the search terms to keywords contained in the thesaurus.
类似技术:
公开号 | 公开日 | 专利标题 US10127274B2|2018-11-13|System and method for querying questions and answers KR102256240B1|2021-05-26|Non-factoid question-and-answer system and method US8583419B2|2013-11-12|Latent metonymical analysis and indexing | JP3266246B2|2002-03-18|Natural language analysis apparatus and method, and knowledge base construction method for natural language analysis US10296584B2|2019-05-21|Semantic textual analysis CN103678576B|2016-08-17|The text retrieval system analyzed based on dynamic semantics EP2045728A1|2009-04-08|Semantic search US20040117352A1|2004-06-17|System for answering natural language questions CN103229120A|2013-07-31|Providing answers to questions using hypothesis pruning CN103229223A|2013-07-31|Providing answers to questions using multiple models to score candidate answers US10943064B2|2021-03-09|Tabular data compilation US20100174704A1|2010-07-08|Searching method and system KR101524889B1|2015-06-01|Identification of semantic relationships within reported speech Pakray et al.2011|A Textual Entailment System using Anaphora Resolution. WO2018158626A1|2018-09-07|Adaptable processing components Luong et al.2014|Lig system for word level qe task at wmt14 BE1012981A3|2001-07-03|Method and system for the weather find of documents from an electronic database. Dorji et al.2011|Extraction, selection and ranking of Field Association | Terms from domain-specific corpora for building a comprehensive FA terms dictionary JP5718405B2|2015-05-13|Utterance selection apparatus, method and program, dialogue apparatus and method Manne et al.2012|Extraction based automatic text summarization system with HMM tagger US10606903B2|2020-03-31|Multi-dimensional query based extraction of polarity-aware content JP4428703B2|2010-03-10|Information retrieval method and system, and computer program BE1018996A3|2011-12-06|METHOD AND SYSTEM FOR INTELLIGENTLY INDEXING DOCUMENTS OR TEXT IN A COMPLEX DATABASE. Abulaish et al.2005|Biological ontology enhancement with fuzzy relations: A text-mining framework Sutcliffe et al.2005|Cross-language French-English question answering using the DLT system at CLEF 2005
同族专利:
公开号 | 公开日 BE1018334A5|2010-09-07|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题 EP0952535A1|1998-04-22|1999-10-27|Het Babbage Instituut voor Kennis en Informatie Technologie "B.I.K.I.T."|Method and system for retrieving documents via an electronic data file|
法律状态:
2022-01-19| HC| Change of name of the owners|Owner name: BUREAU DE RYCKER BV; BE Free format text: DETAILS ASSIGNMENT: CHANGE OF OWNER(S), CHANGE OF OWNER(S) NAME; FORMER OWNER NAME: GROUP DADO 13, BESLOTEN VENNOOTSCHAP MET BEPERKTE AANSPRAKELIJKHEID Effective date: 20211124 |
优先权:
[返回顶部]
申请号 | 申请日 | 专利标题 BE200800604|2008-11-04| BE2008/0604A|BE1018334A5|2008-11-04|2008-11-04|METHOD AND SYSTEM FOR INTELLIGENTLY INDEXING DOCUMENTS OR TEXT IN A COMPLEX DATABASE BY AVOIDING "NOISE".| 相关专利
Sulfonates, polymers, resist compositions and patterning process
Washing machine
Washing machine
Device for fixture finishing and tension adjusting of membrane
Structure for Equipping Band in a Plane Cathode Ray Tube
Process for preparation of 7 alpha-carboxyl 9, 11-epoxy steroids and intermediates useful therein an
国家/地区
|