专利摘要:
The invention relates to a biomedical or bio-molecular retrieval engine for searching, selecting and interactively analyzing complex structures in large, heterogeneous data sets by means of corresponding drill-downs and roll-up operations. The heterogeneous data sets are filtered by the bio-molecular retrieval engine for bio-molecular, 3D structures and compounds with specific properties. During the loading process of data from a selected database and / or data source, the individual rows of the data sources are indexed by means of a one-to-one nomenclature. The same indexing is applied to each additional data source selected and loaded, with the selected data sources comprising heterogeneous, diverse databases and using indexing as the primary key. By means of a resync operation, the selected, heterogeneous data sources and the retrieval engine are coupled asynchronously. The content of the data of the heterogeneous data sources is loaded in parallel into a staging table of a data warehouse, wherein at the time of loading each line of a respective loaded data file is indexed with the unique nomenclature of the primary key and each line of a data file with the loading date, and load time point is stored assigned and the content of the entire line of the data file is copied to the loading line in the table line. Using the bio-molecular retrieval engine, users perform search operations on the data stored in the staging table. The data is mapped from the staging table query-specific and dynamically in dimensional models and / or in codd data models and / or object-oriented to corresponding objects of applications and / or in evolutionary data models. Finally, an intelligent, interactive, front-end tool of the bio-molecular retrieval engine generates the appropriate structural end-to-end drill-down and / or roll-up operations.
公开号:CH712619A2
申请号:CH00771/17
申请日:2017-06-14
公开日:2017-12-29
发明作者:Putrino Nunzio
申请人:Futureitcom Gmbh;
IPC主号:
专利说明:

TECHNICAL FIELD The present application relates to finding and comparing complex structures in heterogeneous information management strategies, in particular to a system and method for providing solutions for handling spatial biological structures accessible in heterogeneous, relational and / or scalable data collection, modulation, and Access. In particular, the invention relates to the provision of a biomedical and / or bio-molecular retrieval engine and intelligent, interactive, dynamic (REI2VA) visualization method and corresponding application and human machine interface for structures in extensive data sets of heterogeneous storage systems, relational / non-relational databases , and / or graph databases, for example in a cloud.
Background Art Fetching, managing, comparing, and automatically analyzing information associated with complex structures, particularly in the context of complex spatial structures such as spatial, biological structures, typically presents numerous technical problems associated with data integration and -interprétation, especially if the data comes from several heterogeneous sources. For example, difficulties often arise in attempting to store large data volumes of heterogeneous databases and data management structures, e.g. to analyze user visualization on complex structures together. In addition, integration of complex information from one or more sources, which in particular can be configured to provide real-time data with other desired information, presents accessibility, bandwidth and latency issues, thereby increasing the flexibility, scalability and accessibility of these systems as a whole or is limited to parts thereof.
[0003] Conventional systems associated with monitoring and controlling various operating parameters for querying complex components and subcomponents of stored structures, particularly spatial structures, may require that large amounts of real time and / or near real time data be accessed process and / or analyze how it z. As is the case with continuous monitoring of chronically ill patients and / or those with rare diseases. If the data themselves are collected in real time, they are also referred to as point data. The data may come from independent, heterogeneous sources, each source being configured to provide already processed or raw or native information in specific structures, such as numerical values associated with different metering measurements. Alone, these data may not provide a consistent context for their interpretation, and additional information must be assigned to them for meaningful processing and analysis. Moreover, it may be desirable to capture, store, and distribute the data from these heterogeneous data sources to other processing components, so that some degree of context must be attributed to this data of complex structures.
A limitation found in many conventional systems is that they provide only limited capabilities for accessing, interpreting and / or manipulating data based on the acquisition of complex spatial structures collectively or in conjunction with other such data. In particular, these capabilities relate to the category of context providing information associated with the data, which may in one respect extend the functionality and meaning of the data for complex structures. Contextualization information may include, for example, descriptive and / or attribute information characterizing the data, as well as other information. In conventional systems, integral and flexible manipulation based on the detection of spatial complex structures is limited due to the inherent differences and characteristics of the data sources. In particular, current solutions in the field of molecular biology can not provide interactive, dynamic drill-down and roll-up in the area of very large amounts of data and complex queries. In particular, the detection and direct linking of pre-clinical or clinical laboratory data acquired in vivo with biomolecular data and structures of the most diverse "-omics" worlds and / or biopsy systems is impossible in the prior art.
Systems that allow uniform access to stored, complex spatial structures are crucial, especially in the development of new drugs, since there must be searched for active binding sites. However, conventional tools are only suitable for static surfaces. Query and retrieval engines coupled with interactive, exploratory interfaces are needed to find and analyze spatial biological structures according to various criteria. In the prior art, the Research Collaboratory for Structural Bioinformatics (RCB) Protein Data Bank (PDB) is such a database for the detection of complex, biological, macromolecular, spatial structures (see www.rcsb.org/pdb/home/home. do), which collects all archived, known structures as text files, archives them and makes them accessible. That The pdb text files contain information about the 3D structure of large biological molecules, including proteins and nucleic acids. These are the molecules found in all living organisms including bacteria, yeasts, plants, insects or other animals, especially humans. Based on an understanding of the structure and shape of these molecules, their structural role in human health and disease can be deduced, and used to drug development and / or prescribe therapies. The data and stored structures in the pdb database from tiny proteins and parts of DNA (deoxyribonucleic acid) to large, complex, spatial structures such as ribosomes as macromolecular complexes of proteins and ribonucleic acids (RNA) present in the cytoplasm, in the mitochondria and in the chloroplast deposits. Access, comparison and analysis of these structures are therefore fundamental in many fields of technology, in particular in biomedicine and agricultural engineering and development, from protein synthesis to the development of drugs and functional foods.
However, the data text files, in short pdb files or pdb structures, are of very little use to end users when it comes to data mining (intelligent data analysis) or data analysis or prediction in the broader sense. Also connecting the data from the pdb data files to other databases, such as the Swissprot data files, in particular relational databases, graph databases, heterogeneous storage systems and storage formats of the data and / or the associated applications / programs, ontologies and taxomies, difficult to impossible to extract from the complex data comparable spatial information. In the prior art, for this purpose, users must perform endless, lengthy, costly, repetitive, and time-consuming operations on each file, regardless of the data source, and develop a recording system themselves to record the respective intermediate results. Even if a possible pdb organization offers possibilities to restrict the search field according to certain criteria, the user will have the file-per-file search (more than 113 000 text files and more than 1.2 billion lines, as of May 2016 and more than 530Ό00 proteins Swissprot, tendency steadily rising) not spared. So that (i) any connection to other heterogeneous data sources, such as e.g. Swissprot, (ii) a comparison of the data in large scale or high complexity, (iii) performing complex queries on the extensive data sets, or (iv) a controlled, homogeneous enrichment of the data is virtually impossible, is for those skilled in this Starting position obviously.
Technical Problem It is an object of the invention to provide a technical solution which does not have the disadvantages discussed above. The solution is not only intended to function under laboratory testing, but also to be applicable in the practice of such heterogeneous relational databases and systems, e.g. Systems developed on Oracle Exadata. (Oracle Exadata or the Oracle Exadata Database Machine is a mutually optimized and jointly developed software and hardware to achieve high performance and availability when running Oracle Databases.) Oracle Exadata architecture includes horizontally scaled servers, typically the industry standard, and intelligent storage servers with advanced flash technology and InfiniBand internal high-speed lines, allowing flexible configuration of systems to suit specific database workloads.) The solution is designed to allow for a very short time (real-time or near-real-time), ie in the range of seconds to a few minutes, to examine a large variety of hypotheses in the stored spatial structures, and to focus by simple handling on the essentials. The solution should also be easily transferable to other large distributed databases, e.g. on distributed, inexpensive NoSQL databases, provided that no claim to absolute security, absolute privacy or protection of product protection is levied, whereby operations are performed on disjoint data sets in order to prevent parallel or even massively parallel processes from being blocked by a few individual executions and question the overall system performance.
Summary of the Invention According to the present invention the above objects are achieved in particular by the claim features of the independent claims. Further advantageous embodiments can be obtained by the dependent claims and the description.
According to the present invention, the above-mentioned objects for a biomedical and / or bio-molecular retrieval engine for searching, selecting and interactively analyzing complex structures in large, heterogeneous data sets are in particular achieved by virtue of the present invention, the heterogeneous Data sets are filtered according to bio-molecular, 3D structures and compounds with certain properties that the loading process, data from a selected database and / or data source, the individual rows of data sources are indexed using a one-to-one nomenclature, with at least the loading date and load time are automatically assigned the same indexing is applied to each additional selected and to load data source, the selected data sources include heterogeneous, diverse databases and the indexing is used as the primary key on which the bio-molecular retrieval engine and its opera Based on an automated resync module, the selected, heterogeneous data sources and the retrieval engine are asynchronously coupled, whereby the data in the retrieval engine is automatically updated to new or changed data of the selected, heterogeneous data sources to ensure the consistency of the primary key and / or load date, and load time and / or data content scheduling and / or versioning and historizing data over the entire lifecycle of the data for an entire Retrieve process that parallels the content of data from the heterogeneous data sources into a staging Table of a Data Warehouse (DW) is loaded as optimized for the analysis purposes central database using the bio-molecular retrieval engine, at the time of loading each line of a respective loaded data file is indexed with the unique nomenclature of the primary key and each line of a data Files with the loading date , and load time associated with is stored and the contents of the entire line of the data file to
Loading time is copied to the table line that by means of the bio-molecular retrieval engine by users on the data stored in the staging table search operations are performed, which for each search operation relevant data re-filtered, sorted and / or enriched and / or be recombined and the respective obtained subsets of data stored in further, dynamically created tables, that the data from the staging table query specific and dynamic in dimensional models and / or in codd data models and / or object oriented to corresponding objects of applications and / or be modeled in evolutionary data models that data marts and / or systems in the cloud as a copy of a subset of the data warehouse (DW) are created within the data warehouse as specific, individual workspaces with defined, own access rights and security measures, and that using an intelligent, interactive front-end The final tools of the bio-molecular retrieval engine are the appropriate structural end-to-end drill-down operations and / or roll-up operations to be generated. By means of a smart, interactive front-end tool of the bio-medical and / or bio-molecular retrieval engine of the users, e.g. the corresponding structural end-to-end drill-down operations and / or roll-up operations for visual navigation in the bio-molecular, 3D structures are used and enriched with pre-, clinical, and laboratory data, the generated Drill-down operations and / or roll-up operations can be updated dynamically by means of the bio-medical and / or bio-molecular retrieval engine. By means of the biomedical and / or bio-molecular retrieval engine, e.g. the user the quality of the result can be visualized by means of visualization of the data for independent assessment. Retrieval results of the biomedical and / or biomolecular retrieval engine may e.g. be generated in pdb format and / or a common communication format accessible to the user. Common communication formats may include at least XML (Extensible Markup Language) of the World Wide Web Consortium (W3C) and / or text and / or Portable Document Format (pdf) of Adobe Systems and / or Hypertext Markup Language (HTML) of the World Wide Web Consortium ( W3C) and the Web Hypertext Application Technology Working Group (WHATWG) or HL7 or DICOM. XML, pdf and HTML are markup languages for representing hierarchically structured data in the form of text files. XML, pdf and HTML are presently particularly suitable for a platform and implementation-independent exchange of data between the biomedical and / or bio-molecular retrieval engine and other electronic systems, in particular via the Internet. Furthermore, the retrieval results can be determined by means of the biomedical and / or bio-molecular retrieval engine for machine-to-machine transfer and / or deep learning and / or artificial intelligence, e.g. encrypted, in particular symmetrically or asymmetrically encrypted. Thereby, e.g. the interactive retrieval, insertion, selection, roll-up and / or drill-down processes are based on and / or using metadata and ontogies to connect heterogeneous SQL and / or NoSQL systems of any kind. This enables the exchange of information, simplifies, accelerates and secures the data quality. The evolutionary data models may e.g. Data models according to JSON (JavaScript Object Notation) include. In general, the heterogeneous data sources can also include document-oriented databases in which documents form the basic unit for storing the data. While relational databases consist of database tables that are subject to a fixed database schema, the document-oriented databases contain individual documents. These documents may be structured files with a standard file format (such as a word processing program file), but also e.g. Binary large objects that are not further structured in terms of database access (e.g., mpeg files). Structured files with a freely definable schema consist of a series of data fields, each consisting of a key-value pair. Further possible data formats are for example JSON objects, YAML documents (YAML Is not Markup Language) or XML documents (Extensible Markup Language). NoSQL databases, like the document-oriented databases, are also databases that can store data in non-tabular form and without the limitations of the relational database, such as graph databases. Finally, the data or the data marts or heterogeneous systems in the cloud may e.g. be transparently end-to-end encrypted according to an associated sensitivity. At least one of the heterogeneous databases could e.g. be implemented as Oracle RDBMS 12c, using the properties of the Container Database (CDB) by assigning each user a separate Pluggable Database (PDB).
In the concrete example below, both the strength of the solution and the expansion and improvement of the solution is shown, such as the integration of heterogeneous resources to the biomedical and bio-molecular retrieval engine and intelligent, interactive, dynamic (REI2VA) visualization platform built as a horizontal layer over all possible types of information silos, operated and in all respects as a service, such as heterogeneous data sources and / or applications and / or taxonomies and / or ontotogies and / or other areas such as genomics, metabolomics, general "-omics" but also for biopsy data silos scaled for the direct integration of preclinical, clinical and laboratory data. The advantages of this invention are, in particular, that the methods and methods already described for preparing the data for the retrieval and presentation process are solid, as can be shown for the method in the specific case of myelin-binding proteins. Another advantage is that the REPVA platform forms a horizontal layer over all sorts of data, program applications, taxonomies, ontology silos, which can be asynchronously built, modified, extended and operated independently of location and time. In the specific case of the zinc finger pattern (see FIG. 5), it is a question of proteins or enzymes containing 2 (or more free) cysteines and their SH groups roughly pointing in the same direction, but their position in the polypeptide chain is difficult find is. The sulfur atoms should be between 3.5 and 5.0 Â apart and accessible from the surface. The front-end I2VA of the intelligent, interactive visualization process of the platform, see example of the front page (FIG. 11), allows to store, manage, operate, extend or even delete all logically related resources as metadata without the functioning of the platform as
To question the whole. The prior art does not provide a system and method that directly addresses biomedical issues from preclinical and clinical laboratory data, biomocular data of the omics, and / or biopsies using biomolecular methods through a front-end such as I2VA To be able to answer structures and procedures and / or establish relationships. In addition, for the first time, the retrieval engine provides technical means to extract from all proteins within the PDB database, e.g. with more than 113Ό00 text files, which together make up more than 1.2 billion lines, the zinc-finger pattern within a few seconds, e.g. Less than 60 seconds, Fig. 9 and Fig. 10 are found to exclude those structure files that explicitly in the header and / or Remark have the label "ZINC PATTERN". This type and variety of proteins are known as myelin-binding proteins, which may be very important for the medical treatment of multiple sclerosis (MS) in particular. The pdb-filenames of these particular 4286 myelin-binding proteins, their methods and methods to find them-scalable to any set of PDB files-are explicitly included in this extension (see Figure 11).
BRIEF DESCRIPTION OF THE DRAWINGS [0011] The present invention will be explained in more detail by the following examples with reference to the drawings, wherein:
1 schematically shows the data flow using the example of the PDB data illustrated from the data source (1, 2) up to specific data mart (3, 4, 5), and linking RE with I 2VA (6) and propagating the results as output (7) (see Fig. 2). As an extension, the data warehouse (with appropriate customizations) may become a cloud (private and / or public) and the data marts, the heterogeneous storage systems distributed in the cloud.
Figure 2 schematically illustrates the bio-molecular retrieval engine that interacts with the intelligent, interactive visualization application and data in both SQL databases and NoSQL distributed databases and in the cloud.
Fig. 3 shows example stored in pdb files error-prone data.
By way of example, FIG. 4 shows a staging table of a data warehouse (DW) accessed in parallel for the
Optimized central database for analysis purposes, whereby the content of the data from the heterogeneous data sources is loaded in parallel into the staging table of the data warehouse (DW) by means of the bio-molecular retrieval engine.
Fig. 5 shows an example of a zinc finger motif.
Fig. 6 shows a theoretical solution to determine all possible distances between the atoms Ca and Resi due CYS, Cp and Residue CYS and SG and Residue CYS.
Fig. 7 shows the verification of the theoretical solution with the protein pdb2wut, find the coordinates for the zinc finger pattern, columns ATOM_NAME and RESIDUE_NAME on the line exactly, column ATMKVLJD and ATOM_SERIAL_NMBR.
Figure 8 shows the query code for obtaining the 4286 pdb filenames with a resolution in the range between 0.5 Â and 3.5 Â and a next-to-next distance between 2.2 Â and 5.0 Â.
Figure 9 illustrates the method and method to find the zinc-finger pattern within over 1.2 billion lines of all pdb data files.
Figure 10 shows the method and method for determining the distance between the Ca-Cys and SG-CYS atoms for the zinc finger pattern.
Figure 11 illustrates the front-end of the REI2VA visualization platform as a link and hub for scalable connectivity of information and resources as a service.
Fig. 12 shows an example of a searched pattern. In the example "DFG" of Fig. 11, this results in 196 possi possibilities. The color codes below the results "196 hits" correspond to a density distribution of the various real solutions as a function of the various parameters, which can be moved and combined with different sliders - center down in the image and automatically affect the retrieval engine.
Fig. 13 shows the view of the data is coupled between visualization and digitization. Knobs, in the example of Fig. 12, "Table" and "Dia" in this picture, allow a smooth presentation of different views for the same issue.
Figure 14 illustrates how changing the parameters by means of the carriages changes the number of correct solutions, in the example from 196 hits to 77 hits. This is immediately apparent by the dilution of the color codes, according to the density distribution of the solution.
Fig. 15 shows the visual change of the information on "Table". Figure 13 shows that for some proteins - name of the protein in the first column in the table and the exact row in the second column in the table - within a particular pdb file, there is conspicuously no "one-letter-code" notation although the pdb notation exists for both the original text file and the relationally stored data. The retrieval engine has identified special cases where DNA, and / or RNA structures, ie of the genomis, are directly linked to protein structures, ie proteomics, to the different chains (A or B or ...) of the protein structure or in the pdb files found nomenclatures for which no uniquely identifiable, one-letter-code exists.
Fig. 16 shows the platform REPVA consists of scalable, extensible, loosely connected components which, taken separately, can be installed, operated and managed in a private and / or public cloud.
Fig. 17 shows how end-users can adjust the level of difficulty of the retrieval engine based on the respective expertise in the front-end (Figure 11). This gives end users the ability to independently control the complexity, number of records and / or resources.
Fig. 18 illustrates the underlying operation of the interactions platform, visualization, end-nut zer.
DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT Fig. 1 schematically illustrates an architecture for a possible realization of a variant embodiment of the bio-medical and / or bio-molecular retrieval engine for searching, selecting, and interactively analyzing complex structures in large, heterogeneous data sets using appropriate drilldown and roll-up operations. By means of the bio-medicai and / or bio-molecular retrieval engine, the heterogeneous data sets and / or data sources and / or databases are filtered for biomolecular, 3D structures and compounds with specific properties, wherein the bio-medical and / or or bio-molecular retrieval engine is an automated, real-time, or near-real-time system for interactively searching, comparing, and aggregating heterogeneous, diverse databases. A user can access it using a single standardized request tool and interface. In particular, the heterogeneous distributed database systems may not only have technical differences such as e.g. various file formats and / or access protocols and / or request languages, but also based on different data models, e.g. various ways to store the same data, e.g. Encoding the column names. At the same time, the bio-molecular retrieval engine can simultaneously be a common query system for object-oriented and relational databases, since such differences in the data sets can also be combined with the bio-molecular retrieval engine. For the file format, it defines the syntax and semantics of data within a file. It thus represents the bidirectional mapping of the accessible stored information to a one-dimensional binary memory. Knowledge of the file format is essential for interpreting the information stored in a file. Typically, files must be mapped to applications via the file format, which can then interpret the files. Protocols in database technology are rules that determine the format, content, meaning, and / or order of accessible information between different engineering units. Finally, the query language or retrieval language typically forms the database language to search for information. The result of a query forms a subset of the underlying information stock. This is also referred to as filtering data. In the prior art, one distinguishes query languages according to their thickness. A query language is more powerful than another query language if it separates the data sharply than the other, i.e., Therefore, if the amount of search result sets that can be formed in it includes the amount of search result sets that can be formed in the other query language. An example of a query language for XML information systems is the XML query language XQuery. The database language SQL already contains a query language for corresponding database systems.
The bio-medical and / or bio-molecular retrieval engine thus makes it possible for a user to search a very large amount of data in bio-molecular or other 3D structures and to find in compounds with specifiable properties. It is also possible to include similarities in the structures in the search through the bio-molecular retrieval engine. As an example, the bio-molecular retrieval engine may be e.g. based on the relational Oracle Database Release 12c. The retrieval engine acts e.g. more than a billion lines in the pdb data source and more than 520Ό00 Swissprot data file proteins. The indexing of the Bio-Molecular Retrieval Engine by means of the one-to-one nomenclature of the primary key takes place, for example, in this case. during the loading process of a pdb data file by automatic translation of the pdb-three-letter-code of the amino acids from the pdb-database into the one-letter-code of the amino acids of the Swissprot database, whereby the Protein-ID of SwissProt has a direct relation to clinical pictures The reverse approach is also possible, whereby the content of the data from the heterogeneous data sources is automatically loaded in parallel into a column of the staging table of the data warehouse (DW) by means of the bio-molecular retrieval engine. This means that during the loading process of the pdb data files, the translation of the three-letter code of the amino acids from the pdb database into one-letter code of the amino acids of the Swissprot database and vice versa is performed automatically and automatically into a column of the staging table of the Data Warehouses (DW) is loaded. The automatic data analysis between the pdb and Swissprot data sources (proteins and the associated disease IDs) is made possible by the retrieval engine, in that during the loading process of the pdb data files, a mechanism automatically releases all three-character code (three- let-ter-code) translates the amino acids from all pdb files into one-letter code (one-letter-code) and persists the result into a specific column. By means of the retrieval engine, both nomenclatures (one-letter and three-letter code of the amino acids) for each protein from the pdb data files can be compared with the one-letter code of the Swissprot data file. Conversely, the retrieval engine allows the one-letter code of the Swissprot proteins to be automatically associated and translated into the three-letter code of the pdb language. Analogously, and in particular, if security, privacy and product protection aspects are not mandatory or should be considered in regulatory terms, the bio-molecular retrieval engine can also be extended, linked and transferred to systems distributed by NoSQL. By means of the bio-molecular retrieval engine, in any case SQL and NoSQL databases can be connected to the security measures, directly and / or by means of ontotogies, taxonomies and metadata and interrogated system-wide (see FIG. 3). It is important to note that NoSQL databases do not necessarily forgo the Structured Query Language (SQL). Although some non-SQL systems rely completely on non-relational functions, others do without specific elements, such as fixed table schemas. For example, instead of tables, a NoSQL database can organize data as objects, value pairs, or ordered lists and series.
One of the many core functions of the retrieval engine has the task of loading the data into the database, the individual lines of the data sources, e.g. from the from pdb data source, one-to-one index, load date and load time automatically (see process step 2 in Fig. 1). These types of indexing and marking are considered primary keys and are fundamental to the entire data model and to all retrieval operations of the retrieval engine. They are applied one-to-one through the retrieval engine to all other heterogeneous data sources.
The operation of the biomedical and / or bio-molecular retrieval engine is realized via a unified, intelligent, interactive visualization tool and interface (PVA), allowing for users, even without further knowledge in SQL or PL / SQL or Java programming , Drilldown or roll-up process is greatly simplified, and at the same time accelerated by orders of magnitude. Based on the current results, users can independently judge and decide the quality of the result. The very greatly reduced decision time also allows a user to evaluate iteratively different and multiple hypotheses, which was previously technically impossible with the systems of the prior art. The front-end transmits parameters to the database models that are implemented for maximum performance and are created dynamically or ad hoc according to the use case.
The extended data analysis on pdb, and Swissprot data source is, among other important opportunities by the interactive, biomedical and / or bio-molecular retrieval engine also feasible. In particular, the retrieval engine also provides for the two specific real-world questions (i) "where in the entire set of all pdb files and Swissprot proteins is the DFG motif" and (ii) "where in the total amount of all pdb files can the Zinc-Fin-ger pattern be localized »reliable answers. The answer to the first question concerns kinases. The second answer is related to myelin-binding proteins. The retrieval engine makes it possible to limit the search range arbitrarily, e.g. Organism, resolution interval of a protein, calculation of the interatomic distances or taking into account the distance interval between the relevant atoms u.v.n.m. Again, this is not possible with the prior art systems, or at least not in real-time or near real-time.
The inventive retrieval engine also allows it with the data quality of the data stored in the relational database (s) can be checked. Corresponding mechanisms allow the user u.a. to reconstruct the original pdb file with more than 12Ό00 lines from the entire data set of the heterogeneous data sources of over one billion lines within a few seconds in the correct, logical and identical order as the original pdb file. This makes it possible for a user to clearly indicate the row and column where and what kind of error occurs in the data. After that, only the corrected lines of the new pdb file are to be stored in the RE (Retrieval Engine). The previous, erroneous lines may be e.g. be timestamped, terminated, historicized and / or versioned. The corrected rows receive the one-to-one indexing, as the primary key, and the time stamp from the moment the corrected data rows are stored in the RE.
The data analysis between the pdb and Swissprot data sources is made possible in the retrieval engine by the fact that during the loading process of the pdb data files, a mechanism automatically generates all three-letter code of the amino acids from all pdb data. Files are translated into one-letter code (one-letter-code) and the result is persisted into a specific column. Users can compare both nomenclatures (one-letter and three-letter code of the amino acids) by means of the retrieval engine for each protein from the pdb data files with the one-letter code of the Swissprot data file. Conversely, a mechanism allows the one-letter code of the Swissprot proteins to translate into the three-letter code of the pdb language.
The retrieval engine is e.g. for single-computer systems with older database versions, such as Oracle version 11gR1, whereby the capacity of the computer can be used correspondingly reduced data sets, as well as for more modern database systems, such as. the Engineered Systems of Oracle, Exadata quarter and fill rack with Oracle RDBMS Release 12c. This makes it easy to record large data sets, such as the complete datasets of the pdb data files, which contain over 113,500 pdb text files for more than 1.2 billion lines (as of May 2016) and at the same time use the equally complete Swissprot data set. All processes can be automated in the retrieval engine and, in particular, can be implemented in parallel. In addition, the retrieval engine is scalable, robust, secure, of high performance compared to state-of-the-art systems and extensible to handle very specific queries on very large data sets and complex queries. In addition, the retrieval engine can be realized web-based, so that it can be used by a wide variety of end devices or network nodes and accessible by a large group of users. The retrieval engine and the interactive visualization application can be offered individually or jointly by a corresponding service provider and used by a wide variety of «browser-based» terminals.
The process steps 1-7 of Figure 1 include that (1) an automated "resync" mechanism asynchronously couples the most diverse heterogeneous data sources or resources uniquely identified with metadata of any kind, not just the pdb database, whereby the retrieval engine is dynamically updated to the latest data state. The primary key nomenclature consists of <pdb-filename_keyword_linenumber>. The principle is transferable to SQL or non-relational NoSQL, to reconstruct from SQL or NoSQL database records a uniquely, logically identifiable content of the original sources; (2) by means of parallel processes the content of the data from text files or heterogeneous data sources is loaded into a staging table of the SQL or NoSQL database. At the time of loading each line of the respective loaded data text file is indexed with the one-to-one, explained in process step 1 nomenclature of the primary key. In addition, each line from the pdb, or heterogeneous data source is provided with the loading date and time of loading. The content of the entire pdb line is copied to the table line tel-quel at the time of loading. This method is analogously applicable to any other heterogeneous data source; (3) the data stored in the staging table (see Figures 1 and 3) can perform user operations according to the use case to filter, sort, or recombine data or recombine data on a case by case basis; to store respective received subsets of data in additional, even ad hoc tables and / or views or to generate ad hoc programs. The nomenclature for the created tables and / or views follows the pattern <organism_keyword-pdb-file_T>, where the "keyword-pdb-file" is the first keyword defined according to pdb rules and identifies each line in the pdb file; (4) data from the staging table case-dependent or also dynamically and / or ad hoc in Dimensional Models, data models according to the rules of Codd or object-oriented application objects or in evolutionary data models, e.g. JSON, be imaged; (5) Data, results, other individual sub-retrieval engines and procedures in the individually secured workspaces, so-called data mart or in the cloud, are stored. Only explicitly granted permissions allow third party access to other data marts or parts of other, individual workspaces or systems in the cloud. The data marts within the data warehouse are specific, individual workspaces with their own defined access rights and security measures. In the case of Oracle RDBMS 12c with the properties of the Container Database (CDB), it makes sense to assign a separate Pluggable Database (PDB) to each end user. Mechanisms allow track and trace, correction, historization and versioning of malformed datasets (see Fig.3); (6) the intelligent, interactive front-end (see process step 6 in Fig. 1) allows a large group of users to utilize the entire infrastructure from a variety of terminals; (7) The intelligent, interactive visualization application allows a user to independently assess the quality of the results. Finally, results, information or knowledge are retrieved in process step 7 in pdb format but also in all common communication formats (XML, text, pdf, HTML, encrypted for the transfer from machine to machine, deep learning, artificial intelligence ...) from retrieval Engine into the outside world.
It should be noted that the biomedical and / or bio-molecular retrieval engine for searching, selecting, and interactively analyzing complex structures both process steps 4 and 5 of the known Oracle processes (see docs.oracle.com/cd/ B 19306_01 / datamine. 102 / bl 4340 / blast.htm) and accelerates the loading process from the Swissprot data file into the relational database. Thus, prior art steps four and five are obsolete by the biomedical and / or bio-molecular retrieval engine. Specifically, in step four of the loading process known by Oracle, it's about «4. Create a control file named sprot.ctl with the following contents: », in the following step« 5. Finally, load thè data: sqlldr userid = <user_name> / <passwd> control = sprot.ctl log = sprot.log direct = TRUE data = sprot40_formatted.txt »to save the Swissprot data into the relational database table. The process integrated in the retrieval engine executes "malformed" lines from the original Swissprot data file in a separate file. After correction, the same mechanism stores the corrected data in the table. This process is iterated until all "malformed" Swissprot records are written to the table.
The explicit example for finding the zinc finger pattern within the pdb data text files (more than 113 000) exemplifies the core functionalities of this generic, evolutionary solution to ad hoc and fast any other structures determine. The principles of this solution can be summarized in: unambiguous, consistent partitioning of the entire dataset, parametrization of the desired patterns and their logical relationship, massively parallel execution of the unique operations on clearly disjoint, logical and in context, consistent data sets. Thus, as shown in the example, any types and varieties of atoms and residues u.v.n.m. be applied within the total amount of all the ever-growing amount of data. PROCEDURE f ± ndZNP ± ngerPattern_Pl is TYPE atom_key_val IS TABLE OF pdb_atom_keyvalues_t% ROWTYPE; keyval_tab atom_key_val: = atom_key_val ();
BEGIN select * BULK COLLECT INTO keyval_tab from pdb_atom_keyvalues_t PARTITION (PI) pa where (pa.ATOM_NAME like '% CA%' or pa.ATOM_NAME like '% CB% 1) and pa.RESIDUE_NAME like 1% CYS%' Order by pa. ATMKVL_ID; FORALL i in keyval__tab.first .. keyval_tab.last INSERT INTO cysteinRetrieved_Pl_t VALUES keyval_tab (i); COMMIT; select * BULK COLLECT INTO keyval_tab from pdb_atom_keyvalues_t PARTITION (PI) pa whera pa.ATOM_NAME like '% SG%' and pa.RESIDUE_NAME like '% CYS%' order by pa.ATMKVL_ID; FORALL i in keyval_tab.first .. keyval_tab.last INSERT INTO cysteinRetrieved_Pl_t VALUES keyval_tab (i); COMMIT; END findZNFingerPattern_Pl; The parameterization consists in that (a) the core restrictions in the "where-clause" can be arbitrarily extended, b) the Boolean conditions can be adapted situationally both between the patterns and between the restrictions. "On the fly", the method of this solution automatically generates the program to be executed (a) for the platform where it needs to be executed and (b) exactly where the analyzed data is physically stored. This is the maxim "bring the programs to the data and not vice versa". This achieves many advantages, such as e.g. short latency times, optimal use of the bandwidth of communication network (only the solution, a subset of the total amount of data is transported), optimal use of computing power and storage space, no unnecessary data transport, high availability against loss of resources because the loosely Connected and networked Resources are safer against hacker attacks such as DdoS (distribute denail of service) u.v.n.m.
As an example, as parameters p_Muster_1, p_Muster_2, ... p_Muster_n, the Boolean operation between the patterns and / or restrictions called and set with the I2VA in the front-end of the RE (Retrieval Engine) ad hoc and forwarded. pa.ATOM_NAME like '% CA%' to replace with p_Muster_l pa.ATOM_NAME like 1% CB% 'to replace with p_Muster_2 boolean operator b_Op = [or | and | xor | not] [0025] Generic Pattern Example: This allows the generic procedure platform to construct an ad hoc procedure to find other patterns with different conditions throughout the dataset PROCEDURE find_Generic_Pattem (p_pattern_l, p_pattern_2, p_pattern_2,, b_Opl, b_Op2 , ...) where (pa.ATOM_NAME like '% p_Muster_l%' b_Opl pa.ATOM_NAME like '% p_Muster_2%') b_Op2 pa.RESIDUE_NAME like '% p_Muster_3%' Order by pa.ATMKVL_ID; Likewise, the calculation of the distances between the desired atoms, atoms and residuals, residuals and ... ad hoc can be solved by parameterizing the generic solution by means of the front-end I2VA for large amounts of data, as in FIG Methods to determine distances between atoms in large data sets.
Using the example of the zinc-finger pattern is exemplified that in the case of relational, nonrelational heterogeneous systems, ie where the initially given, strict, sequential order structure between successive atoms of the same structure / protein, as this clearly in corresponding Structural text files is given, at any time and for all occurring proteins, always ensuring that the distances are logically only between atoms or atoms / residuals, etc. of the same structure, ie the structure, originally in a given text file was contained and with the storage in a relational database or wherever as "set" (= sets) takes place, thus any order structure has been lost, ad hoc determined and reconstructed.
This task is solved in the example with the following generic condition: if (substr (v_filename (v_count), 1,7) = substr (v_filename (v_next), 1,7)) then v_occur: = fu_count_occur_ATM_File (substr (v_filename () v_count), 1.7));
This ensures that the distances between the atoms in question actually come from the same structure, otherwise the following condition determines that it is a new structure. if (v_next> v_atm_array.LAST) then v_next: = v_atm_array.LAST; v_next: = v_count + 1; end if; Thus, it is always noted, Fig. 10, that only the distances between atoms are calculated, which in the context and in connection with the searched pattern, the atoms in any case and always originate from the original data text file and belong together. From Fig. 10 .: open cr_calcdist; fetch cr_calcdist bulk into v_atm_array, v_filename;
FOR v_count IN v_atm_array.FIRST .. v_atm_array.LAST LOOP if (substr (v_filename (v_count), 1,7) = substr (v_filename (v_next), 1,7)) then v_occur: = fu_count_occur_ATM_File (substr (v_filename (v_count) , 1.7)); while (v_occur! = 0) loop passing_coordinates (v_atm_array (v_count), v_atm_array (v_occur), v_filename (v_count), v_filename (v_occur)); v_occur: = v_occur - 1; end loop; end if; if (v_next> y_atm_array.LASTJthen v_next: = v_atm_array.LAST; v_next: = v_count + 1; end if; END LOOP; close cr_calcdist; [0030] One embodiment variant is also based on explicit examples, such as finding the Zing -Finger patterns in proteins, which are not explicitly marked in the header / remarks of the pdb text files, but are very important for medical treatment such as multiple sclerosis (MS), provides a generic solution to parameterize ad hoc patterns, logical relationships between the patterns by fuzzy search of the given patterns then the distance between the atoms and residuals for related atomic groups and the associated residuals, in addition, this solution shows how with simple means, such as resources the platform REI2VA also on resources in the cloud can be extended.
With the front-end I2VA intelligent, interactive visualization application of the platform, see example of the front page, Fig. 11, it is possible for end users with different levels of competence, from a complete overview of all logically and in context store, manage, operate, augment or even delete contiguous resources as metadata without compromising the rest of the platform.
Any action of any end user, if an end user so desires, may be logged and stored locally in a private, arbitrary location of an end user. Thus, any end user may use data and / or services and / or applications and / or programs and / or computational power anywhere in any accessible cloud server due to the respectively cached and / or addressable URIs (unique resource identifiers) , 18. The automatic transcript is performed by the platform, allowing even several exploratory examinations and their dependencies to be recorded in order to focus on the desired results due to a substantially high quality subset. Even if the connection with the platform is interrupted, the application ensures that all confirmed actions are recorded. In a renewed logon, the platform reconstructs the frozen states prior to aborting the communication.
Using the example of the pdb filename, e.g. pdb4yzf demonstrates the strength of the solution and explains it with the help of resources so that procedures and methods are mutatis mutandi applicable to all other resources, and the platform REI2VA built as a horizontal layer on all possible types of information silos, operated and in every respect .
权利要求:
Claims (11)
[1]
such as. heterogeneous data sources and / or applications and / or taxonomies and / or other areas such as genomics, metabolomics, general "- omics" but also scaled for biopsy data silos. This makes it possible to present each other source of the platform REPVA as a service, connect and use. Only if the targeted source permits it, may an end user of the REI2VA platform make changes in the permitted resources. Conversely, the platform REI2VA can be connected as a service for other services. In the concrete example of pdb-files pdb4yzf methods and methods of this solution can be used on the one hand for all other pdb files, on the other hand, but also for all other types and types of resources. The concatenation, the sub-string applicable in the context for the control of the resources, taking into account the grammatical definitions or syntaxes, can be assembled and as such assigned to a one-to-one resource. Example: "https://pdbj.org/emnavi/quick.php id" URI of the pdb database or the resource = "syntax for the formation of the unique URL" pdb- '' prefix for the pdb- Database "4yzf" unique name of the pdb-file https://pdbj.org/emnavi/quick.php7id = pdb-4yzf is thus the complete resource name that is stored as metadata in the platform, in order to get all in the pdb database to use available resources without any assistance, while at the same time using ad hoc information from the selected resource for the enrichment of the platform The association of the URL or, more generally, the URI (Uniform Resource Identifier) is within the meaning of RFC 3986 (A Uniform Resource Identifier (URI) is understood to mean an abstract or physical resource ...) In the same way as in the example, the needs of one end-user can be fully met other, existing, use, connect, extend, integrate, edit and / or even delete defined and developed types and varieties of resources in the REI2VA, see example https://pdbj.org/emnavi/quick.php9id = pdb-4yzf, without the rest of the Destroy platform REI2VA or challenge consistency, quality, quantity or security of the entire platform. All of the resources available through a URI, as shown in the example, can be used by an end user to register the platform REI2VA, as far as an end user wants it, as described above, all activities. This creates a scalable platform using loosely coupled components, which stores and manages resource names as metadata in the REI2VA and stores the actual resources, e.g. Data, programs, applications, taxonomies, u.v.n.m. the entirety and operation of the REI2VA platform is always guaranteed, see Figures 11, 16, 18, because the resources associated with any but particular case may be arbitrarily changed, supplemented, scaled or even deleted without the REI2VA platform to destroy. Thus, it is e.g. It is possible to distribute computational tools and / or data either in a private and / or public cloud, or to integrate them from the same, so that the platform REPVA can be kept lean and flexible in order, for. For example, end users may be allowed to make evidence-based, medical decision making and patient outcomes with precise ad hoc, e.g. with the inclusion of pre-, and / or clinical and / or laboratory data on a particular disease pattern of a particular individual, whose identity is masked with personal pseudo fingerprints as encryption, to direct and straightforward with countless, widely distributed laboratories, Research and Development Companies and / or Pharmaceuticals to share and / or share information while respecting the privacy, secrecy and anonymity of an affected individual and / or direct and / or healthcare provider and provider mandated by the patient / patient. The examples in Figures 12-15 give an idea of how the user-friendly I2VA enables end-users to explore exploratory tests to focus on substantial details from large amounts of data Figure 14, Figure 15. Thus, any resource can be out-of-scope but also connect within the platform REI2VA as a service. All you need is a connection, such as the Internet or any other permitted and certified connection between a resource and the REI2VA platform. The access rights to the connected resources can be adapted to the needs and implemented. End users can independently control the complexity of the queries in the front end of the REI2VA platform on the basis of their respective expertise. FIG. 16 and FIG. 17. Claims
1. Bio-Molecular Retrieval Engine to search, select and interactively analyze complex structures in large, heterogeneous datasets using appropriate drill-down and roll-up operations, using heterogeneous datasets for bio-molecular, 3D structures and compounds with specific properties be filtered, characterized in that in the loading process of data of a selected database and / or data source in the bio-molecular retrieval engine, the individual lines of the data sources are indexed by means of a one-to-one nomenclature, wherein at least the loading date and loading time are automatically assigned, that the same indexing is applied to each additional selected and to be loaded data source, wherein the selected data sources comprise heterogeneous, diverse databases and the indexing is used as the primary key on which the bio-molecular retrieval engine and its operations are based, that by means of an aut in the retrieval engine, the data in the retrieval engine is automatically updated to new or changed data of the selected, heterogeneous data sources in order to ensure the consistency of the primary key and / or the loading date; and the load time and / or data content scheduling and / or versioning and historization of the data over the entire life cycle of the data for an entire Retrieve process, that the content of data from the heterogeneous data sources in parallel in a staging table of a data warehouse (DW ) is loaded as the central database optimized for the analysis purposes by means of the bio-molecular retrieval engine, wherein at the time of loading each line of a respective loaded data file is indexed with the one-to-one nomenclature of the primary key and each line of a data file with the loading date, and load time associated with stored rt is and the content of the entire line of the data file is copied to the table line at the time of loading, that by means of the bio-molecular retrieval engine by users search operations are carried out on the data stored in the staging table, wherein for each search operation relevant data is re-filtered, sorted and / or enriched and / or recombined and the respective subsets of data stored are stored in further, dynamically created tables that the data from the staging table query-specific and dynamic in dimensional models and / or in Codd data models and / or object-oriented mapped to corresponding objects of applications and / or in evolutionary data models that data marts as a copy of a partial data of the data warehouse (DW) within the data warehouse as specific, individual workspaces with defined, own access rights and Safety measures are created, and that by means of an intel The biomolecular retrieval engine's front end interactive tool can be used to generate the appropriate structural end-to-end drill-down operations and / or roll-up operations.
[2]
2. Bio-molecular retrieval engine for searching, selecting and interactively analyzing complex structures in large, heterogeneous data sets according to claim 1, characterized in that by means of the bio-molecular retrieval engine a user the quality of the result by means of visualizing the data for independent Assessment be visualized.
[3]
3. Bio-molecular retrieval engine for searching, selecting and interactively analyzing complex structures in large, heterogeneous data sets according to one of claims 1 or 2, characterized in that retrieval results of the biomolecular retrieval engine in pdb format and / or a common communication format accessible to the user.
[4]
4. Bio-molecular retrieval engine for searching, selecting and interactively analyzing complex structures in large, heterogeneous data sets according to claim 3, characterized in that the common communication formats at least XML (Extensible Markup Language) and / or text and / or pdf (Portable Document Format) and / or HTML (Hypertext Markup Language).
[5]
5. Bio-molecular retrieval engine for searching, selecting and interactively analyzing complex structures in large, heterogeneous data sets according to one of claims 3 or 4, characterized in that the retrieval results by means of the bio-molecular retrieval engine for the Transfer from machine to machine are encrypted.
[6]
6. Bio-molecular retrieval engine for searching, selecting and interactively analyzing complex structures in large, heterogeneous data sets according to claim 5, characterized in that the retrieval results by means of the bio-molecular retrieval engine for the transfer from machine to machine encrypted and the interactive retrieval, insertion, selection, roll-up and / or drill-down processes are based on and / or using metadata and ontogies, where heterogeneous SQL and / or NoSQL systems of any kind are connectable and whereby the exchange of information is made possible and / or simplified and / or accelerated and / or the data quality is ensured.
[7]
7. Bio-molecular retrieval engine for searching, selecting and interactively analyzing complex structures in large, heterogeneous data sets according to one of claims 1 to 6, characterized in that the evolutionary data models comprise data models according to JSON (JavaScript Object Notation).
[8]
8. Bio-molecular retrieval engine for searching, selecting and interactively analyzing complex structures in large, heterogeneous data sets according to one of claims 1 to 7, characterized in that the data or the data marts according to an associated sensitivity transparent end-to-end encryption become.
[9]
9. Bio-molecular retrieval engine for searching, selecting and interactively analyzing complex structures in large, heterogeneous data sets according to one of claims 1 to 8, characterized in that at least one of the heterogeneous databases is implemented as Oracle RDBMS 12c, the properties of the Container Database (CDB) by assigning each user their own Pluggable Database (PDB).
[10]
10. Bio-molecular retrieval engine for searching, selecting and interactively analyzing complex structures in large, heterogeneous data sets according to one of claims 1 to 9, characterized in that by means of an intelligent, interactive front-end tool of bio-molecular retrieval Engines the user uses the appropriate structural end-to-end drill-down operations and / or roll-up operations for visual navigation in the bio-molecular, 3D structures, using the generated drill-down operations and / or roll -up operations are updated dynamically using the bio-molecular retrieval engine.
[11]
11. Bio-molecular retrieval engine for searching, selecting and interactively analyzing complex structures in large, heterogeneous data sets according to one of claims 1 to 10, characterized in that the indexing by means of the one-to-one nomenclature of the primary key in the loading process of a pdb data file by automatic Translation of the pdb-three-letter-code of the amino acids from the pdb database into the one-letter-code of the amino acids of the Swissprot database and vice versa takes place, whereby the content of the data of the heterogeneous data sources automatically in parallel into a column of the staging table of the Data warehouses (DW) is loaded by means of the bio-molecular retrieval engine.
类似技术:
公开号 | 公开日 | 专利标题
Storey et al.2017|Big data technologies and management: What conceptual modeling can do
Jones et al.2006|The new bioinformatics: integrating ecological data from the gene to the biosphere
Keator et al.2013|Towards structured sharing of raw and derived neuroimaging data across existing resources
Berman et al.2003|The tissue microarray data exchange specification: a community-based, open source tool for sharing tissue microarray data
DE112010000947T5|2012-06-14|Method for completely modifiable framework data distribution in the data warehouse, taking into account the preliminary etymological separation of said data
DE102005040096A1|2006-03-16|Comprehensive query processing and data access system, and a user interface
Brandt et al.2002|Metadata-driven creation of data marts from an EAV-modeled clinical research database
JP6814482B2|2021-01-20|Knowledge management system
Glavic2010|Perm: efficient provenance support for relational databases
Ramzan et al.2019|Intelligent data engineering for migration to NoSQL based secure environments
Azeroual2020|Data wrangling in database systems: purging of dirty data
Denker et al.2015|Designing workflows for the reproducible analysis of electrophysiological data
Marenco et al.2009|Automated database mediation using ontological metadata mappings
Drezen et al.2018|From medico‐administrative databases analysis to care trajectories analytics: an example with the French SNDS
CH712619A2|2017-12-29|Biomedical and bio-molecular retrieval engine.
Cvjetković et al.2014|The ontology supported intelligent system for experiment search in the scientific research center
Heinis et al.2017|Data infrastructure for medical research
Kvet et al.2020|Data block and tuple identification using master index
Godinho et al.2016|Assessing the relational database model for optimization of content discovery services in medical imaging repositories
Chandrababu et al.2018|Comparative analysis of graph and relational databases using herbmicrobeDB
Scott et al.2003|SGS Database: use of relational databases to enhance data management for multi-site experiments
Nadkarni2011|Metadata for Data Warehousing
Eichler2019|Metadata management in the data lake architecture
Martín et al.2008|Enabling cross constraint satisfaction in RDF-based heterogeneous database integration
Amadoz et al.2007|epiPATH: an information system for the storage and management of molecular epidemiology data from infectious pathogens
同族专利:
公开号 | 公开日
CH712592A2|2017-12-29|
CH712619B1|2020-05-29|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题

法律状态:
优先权:
申请号 | 申请日 | 专利标题
CH00772/16A|CH712592A2|2016-06-16|2016-06-16|Bio-molecular retrieval engine.|
[返回顶部]