专利摘要:
System and method of detection of structural genetic variants. Method, system and computer program for the detection of structural variants, which allows its application in different scenarios including directed sequencing using capture methods (including exomes or panels). The invention can be implemented by parallel threads and facilitates a good performance independent of the size of the region of interest or the number of results. The method comprises characterizing (230) each sample by determining a gender as a function of differential chromosome coverage; calculate an experimental covariability through one or more correlation matrices; selecting (406) at least one control structure by iterative clustering (404); establish (213) some work points based on variations with respect to the control structures; and detect (240) the structural genetic variants at the determined work points. (Machine-translation by Google Translate, not legally binding)
公开号:ES2711163A1
申请号:ES201731242
申请日:2017-10-23
公开日:2019-04-30
发明作者:Una Iglesias David De
申请人:Health In Code S L;
IPC主号:
专利说明:

[0001]
[0002] System and method of detection of structural genetic variants.
[0003]
[0004] Object of the invention
[0005]
[0006] The present invention refers to the sanitary and biotechnological sector, and more specifically, to a method and system for the detection, evaluation and exploration of structural genetic variants.
[0007]
[0008] Background of the invention
[0009]
[0010] The genetic information is encoded as a succession of nucleotides. The genome contains genes that encode specific proteins and regulates different functions of the organism. The genetic information is specific to each organism, so that the different variations condition the different physiological conditions of the same. Multiple diseases are associated with genetic variants with deleterious effects or mutations. Although many variants consist of alterations of a few nucleotides, others involve ranges that can range from tens to thousands of these, either in the form of inversion of sequences, relocation in different positions, suppression or duplication of them. Variations involving a large number of positions are called structural variants and, in particular, when they suppose an increase or decrease in the number of copies of a certain sequence, they are called variants of number of copies (CNV, from the English 'copy number variants). '). Since these structural variants have been described frequently associated with adverse health consequences, it is critical to develop techniques for their detection and analysis.
[0011]
[0012] Since the appearance of the Sanger sequencing method in 1977, scientists had the ability to sequence nucleic acids in a reproducible and reliable manner. A decade later Applied Biosystems introduced the AB370, the first automatic sequencing instrument based on capillary electrophoresis that became the main tool to complete the Human Genome project, obtaining a consensus genetic sequence for homo Sapiens. From this first stage in the 21st century, new sequencing technologies (NGS, from the English 'next generation sequencing') have emerged. they have reached the necessary power to rapidly sequence huge amounts of genetic sequences at low cost by performing millions of sequencing reactions in parallel at the same time. Due to the high volume of information generated by NGS methods, currently the greatest complexity has shifted from sequencing processes to computer analysis processes.
[0013]
[0014] The currently established NGS methods in the market, including the leading technology, Illumina, require a sample preparation that includes the amplification (obtaining clonal sequences to those of the original sample) and fragmentation, in order to have a collection of sequences, relatively short, covering the original sequences and sequencing in parallel in the NGS instrument. Once the nucleotides of each fragment have been resolved (or a part of them), a set of readings (also known by their name in English 'reads') that provide the information of the sequence of a section of the material are available. genetic of the starting sample. In the case that sequencing provides pairs of readings that resolve each of the ends of a fragment (the central part can remain unseparated), it refers to the sequencing of pairs of extremes (also commonly called by its English name 'pair-end'). '). A nucleotide in the position p of the reference sequence will have been sequenced by all those readings covering that position p. It is said then that p has a coverage X equal to the number of times it has been read or equivalently the number of readings that cover it.
[0015]
[0016] The sequencing can be directed to the whole sample or limited to certain genetic material of the same. The selection of the genetic material can be carried out by selective amplification of genetic regions of interest ("amplicons") or by recovering fragments of the starting material that contain specific sequences ("capture techniques"). Both exome sequencing (WES, Whole Exome Sequencing) and panels are applications of selective sequencing. Genome sequencing (WGS, the English 'Whole Genome Sequencing' is a case of nonselective sequencing.
[0017]
[0018] Given the prevalence of NGS techniques in the industrial and clmico field, in addition to the importance that the determination of structural variants takes, in the last years have developed different methods for this determination from the readings product of the NGS sequencing. The different approaches can be summarized in four main categories, including combinations of the same. All of them require a process of alignment of the reads against a known consensus sequence, associated with the original genetic material, in order to know which part of the original sequence is covering each read (eg the consensus can be the reference sequence of the human genome). ). The methods of the first category (usually called by their name in English 'read-pair') analyze the orientation and distance between the two lateral sequences of a fragment (work on 'pair-end' sequencing). The second category groups methods (commonly called by their name in English 'split-read') that examine readings that contain sequences that are not given in the reference sequence directly but are combinations of distant sequences. The third group of methods (usually called by their English name 'read count') are those that compare the number of reads that cover a certain region of the consensus sequence in relation to other areas under similar conditions (from the same sample or from the same sample). other control samples) to detect differential regions associated with a relative variation in the quantity of the starting material (eg, the existence of deletions that include said sequences) or variations in the sequence of the sample in said positions making it impossible to find a correspondence with the reference. Finally, other methods try to reconstruct from the readings the complete sequence (not fragmented) of departure to later compare it with the reference sequence, thus detecting structural variations.
[0019]
[0020] The traditional approach to identify structural variants employs cytogenetic techniques such as karyotyping or fluorescence in situ hybridization (FISH, from the English "fluorescence in situ hybridization"). In 2003, the genomic hybridization comparison arrays (CGH, from the English "comparative genomic hybridation") and the nucleotide-unique polymorphism arrays (SNPs arrays, from the English "single-nucleotide polymorphism arrays") appear. With the establishment of the NGS technology, the different detection strategies have overcome some limitations. It is based on data oriented to a more general use and a greater resolution, precision and capacity is reached to detect new structural variants. In spite of the advances, the results obtained with the different methods continue to vary and are unreliable, so there is no universal standard, or a satisfactory solution or set of tools of generalized use.
[0021] Targeted sequencing has been established as a cost-effective way to interrogate regions of interest of a large set of samples, especially in the clinical practice for the study of gene panels related to diseases under evaluation. Although the sequencing of gene panels today contributes most of the studies carried out, mainly in the clinical practice, practically all of the proposals for analysis of structural variants for NGS do not support this type of design for analysis. Also, the small size, dispersion and discontinuous nature of the regions of interest for the directed sequencing imposes many difficulties to apply the methods of detection of structural variants on the data.
[0022]
[0023] The methods based on pair-end or Split-read are unable to detect structural variants in the context of directed sequencing unless the regions bordering on the extremes of the structural variants are covered by areas of interest, which is unusual. The discontinuity of the regions of interest also makes the assembly-based methods not work. In this context, approaches based on readcount or coverage, or a combination with others, are the most appropriate to date. Even so, even methods based on read-count have difficulties when they have not been specifically designed to contemplate this discontinuous reality, since those that have been implemented for complete genome sequencing assume conditions such as the normal distribution of coverage or the continuity of the search space that are not true in directed sequencing.
[0024]
[0025] The few existent methods applicable to the context of the directed sequencing do not have a high precision, needing in practice that one or several exons are affected completely. Both the false positive and negative rates are high and there is no possibility of interpreting the results, so in-depth control of the specialists is not possible to improve the results. Some of the characteristics that contribute to this reality are:
[0026]
[0027] - The selection of mathematical models for the signals of the poorly adjusted data as a result of ignorance of the variables involved in the response signals analyzed (whether read-count, coverage or others) and also the simplification of the information in part by not considering relative information the preparation and the biological context.
[0028] - The methodology for the selection of controls in order to model the signals, are oriented to complete genome sequencing data or exome but not panels. Even under the ideal conditions, the selection methods and the models do not contemplate the necessary variables, such as the effects of the fragmentation of the samples or the characteristics of the biological regions.
[0029] - Regions with high homology, pseudogenes and the effect of genetic variants especially in these regions generate incorrect systematic assignments of reads to chromosomal regions, which result in alterations in the signal in the form of gains and losses that are detected as structural variants. The lack of control of these circumstances in the models for their possible revision makes it very difficult to identify these situations. In addition, these alterations introduce biases in the mathematical models and are not adequately controlled in the selection of controls for the construction of the models. - The proposals are very sensitive to the effects caused by the divergence of the samples and the experimental variability, which is particularly problematic in the sequencing of panels. These effects are considered simply as unexplainable noise.
[0030] - Bias correction attempts that affect the distribution of the analysis signal along the chromosomal regions have not been able to eliminate the biases.
[0031] - The search and detection strategies of CNVs are very dependent on certain parameters set experimentally with a set of data that is not reproducible in practice, are often set directly by the user, this happens for example with the election of calculation windows or bins
[0032] - When there is an assessment of the credibility or quality with which a structural variant has been detected, it is assessed according to the mathematical model, without additional tests considering additional information and without incorporating the necessary information about the context in which the variant has been found. in the face of a second evaluation in depth of the specialist. - The transformations that are carried out on the data or signals for their modeling and presentation when some type of graphic is available cause the trace of the signal-biological characteristic relationship to be lost, hiding the causality.
[0033] The current proposals in addition to having the limitations that have been presented among others, do not integrate well with the clinical practice and as a result:
[0034]
[0035] - They try a high specificity sacrificing the sensitivity, because they do not contemplate the process from a general perspective, where sensitivity must prevail in a first phase, since the tool constitutes a support whose results will be evaluated and confirmed later by the experts since it is counted with a lot of supplementary information, both biological and clinical. The detection limits are problematic and there is a great risk of losing events. - There are important restrictions in the choice of controls, for example regarding the inclusion of relatives or the detection of polymorphic CNVs. - There is not the possibility of exploring the results contemplating individualized information of each case-sample, the traceability is lost. Individual filters are not provided afterwards or specific treatments according to the circumstances of the study.
[0036]
[0037] In short, there is still in the state of the art the need for a reliable gene variants detection tool capable of supporting panel sequencing, which also facilitates management and exploitation for the comprehensive analysis of the results obtained.
[0038]
[0039] Description of the invention
[0040]
[0041] The present invention solves all the aforementioned problems by means of a technique for determining generic variants of global perspective, applicable to various scenarios among which (in a non-limiting manner) is that of medical genetics and panel sequencing.
[0042]
[0043] In a first aspect of the invention, a method of detecting structural genetic variants is presented from sequencing data, preferably from NGS systems, but which in certain embodiments can be obtained by any other known sequencing technology in the state of The technique. The method comprises the following steps for each sample:
[0044]
[0045] - Determine a gender of the sample as a function of, at least, a chromosome differential coverage of the X chromosome. Preferably, said determination of gender is made by comparison of coverage on the X chromosome and the autosomal chromosomes, although other methods of gender determination known in the state of the art can be implemented. This strategy allows a determination of robust gender to low quality samples or poorly sequenced.
[0046] - Calculate, from the sequencing data, at least one correlation matrix that represents the experimental covariability of the plurality of samples. Preferably, two correlation matrices are calculated. The first correlation matrix is preferably based on coverage profile, but can nevertheless be based on a range of positions, a number of mapped readings, a metric derived from a chromosomal region, or any combination of the above. The second correlation matrix is preferably based on the size distribution of sequenced fragments, with or without adapters and other fragments of known length, but can nevertheless be based on separation of readings in pair-end readings of pairs of extremes, or any another size metric known in the state of the art. - Select at least one control structure through iterative clusterization. The clustering is preferably performed iteratively until a convergent result is reached, and more preferably, it is based on the application of a Kmeans algorithm, although other embodiments of the invention may comprise other clustering algorithms known in the state of the art .
[0047] - Establish working points (or "working points") based on the variability and reference values of the control structure (ie, the model substitute) and the study samples. Preferably the determination of work points is done on positions with coverage exceeding a threshold, without this excluding the possibility of working with ranges instead of positions or alternative metrics to the coverage as a number of mapped reads. These characteristics are also preferably studied considering the local context, that is, the information of the proximal positions. These work points represent positions where the model adequately captures the effects of potential structural variants, eliminating possible biases. Preferably, the work points are established as a function of the variations with respect to a reference value. More preferably, said reference value is a mean, median or other measures of central tendency, calculated on the coverage, counting of readings (or "reads"), or any other metric derived from said variables. Also preferably, the variation is calculated by the interquartile range, the variance, the typical deviation or other measures of variation with respect to said reference value.
[0048] - Preferably, normalize the data according to the gender determined for each sample, eliminating enrichment biases. Preferably, the data are normalized using full coverage in the autosomal regions, although one or more determined ranges can be used. Alternatively, the data can be normalized by a value of aligned readings.
[0049] - Detect the structural genetic variants in the established work points, depending on the deviations from the control structures in the determined work points. Preferably, this step also comprises assigning a value indicative of a confidence degree to each structural variant detected, depending on, at least, the deviation with respect to the control structure and the chromosomal region where structural genetic variants are located. Also preferably, this step comprises grouping multiple structural variants detected in a single one, depending on, at least, the deviation with respect to the control structure and the chromosomal region where structural genetic variants are located.
[0050]
[0051] - Also preferably, to screen the results according to at least one factor such as the ratios (number of copies) characteristic of the structural variants to be detected, information on coding areas or panels or other annotations on the regions affected by the altered ranges detected. Also preferably, the results can be screened according to the information provided by the alignment structure of the study sample, with the possibility of analyzing events associated with "split-read" techniques, such as the search for "data points". rupture "(usually quoted by the corresponding English term" breakpoints.
[0052]
[0053] Preferably, the method comprises storing the results obtained following a coding that allows random access and that includes, at least, the following fields:
[0054]
[0055] - A header (1812) with a section of invariable size (1807) and a section of variable size (1808). The invariable size section (1807) comprises data size information, while the variable size section (1808) comprises metadata signaling information;
[0056] - A body (1813) comprising blocks of equal size with chromosomal coordinate information, reference signal (1803) rescaled and rescaled signals (1804, 1805, 1806) associated with each sample.
[0057] - A queue (1814) comprising location information.
[0058]
[0059] Also preferably, the method comprises accessing the results by means of a random access, facilitated by the described codification, and which allows fluid visualization of the results independently of the total file size. In particular, the access method comprises:
[0060]
[0061] - Access the header.
[0062] - Obtain metadata from the results.
[0063] - Obtain information associated with each point to be represented by accessing the body (1813) and retrieving blocks of information of constant size.
[0064]
[0065] In a second aspect of the invention, a detection system for structural genetic variants is presented, comprising detection means that implement the steps of any implementation of the method of the first aspect of the invention. The detection means store the generated results (that is, the detected genetic variants, information linked to them, being able to incorporate information on the control structures as well as meta information about the included data itself) in a data storage means, to the which in turn have access to exploration means. Said means of exploration serve as an interface with the user or with other systems, obtaining the data that needs to be visualized or transmitted at each moment. Depending on the particular implementation of the system, the detection means, storage means and scanning means may be integrated in the same device or be implemented in multiple devices connected by any wireless or wireless connection known in the state of the art. In an implementation option, the storage means and scanning means can be integrated in the same independent computer file generated by the detection means, usable from an external system, either local or remote.
[0066] Finally, in a third aspect of the invention a computer program is presented comprising computer program code means adapted to implement the described method, by executing in a digital processor of the signal, a specific integrated circuit of the application, a microprocessor, a microcontroller or any other form of programmable hardware. Note that any preferred option and particular implementation of the device and system of the invention can be applied to the method and the computer program of the invention, and vice versa.
[0067]
[0068] The method, system and computer program of the invention therefore make it possible to detect genetic variants in a reliable and efficient manner, compatible with panel sequencing, and also facilitate management and exploitation for the comprehensive analysis of the results obtained. The organization of the data also makes it possible to make the computational load independent of the size of the data, speeding up said computation and allowing its visualization by means of random access to the results at a constant time.
[0069]
[0070] Description of the figures
[0071]
[0072] In order to help a better understanding of the characteristics of the invention according to a preferred example of practical realization thereof, and to complement this description, the following figures, whose character is illustrative and are accompanied as an integral part thereof, are accompanied as an integral part thereof. non-limiting:
[0073]
[0074] Figures 1A and 1B schematically show the main elements of respective preferred embodiments of the system of the invention.
[0075]
[0076] Figure 2 presents a flow diagram of the steps performed by the detection subsystem, in accordance with a preferred embodiment of the method of the present invention.
[0077] Figure 3 is a flow chart of the sex allocation process, according to a preferred embodiment of the method of the present invention.
[0078]
[0079] Figure 4 exemplifies a flow chart of the process of assigning sets of control samples to each study sample, according to a preferred embodiment of the method of the present invention.
[0080]
[0081] Figure 5 illustrates a flow chart of the calculation of the correlation matrix based on the coverage profile, according to a preferred embodiment of the method of the present invention.
[0082] Figure 6 presents a flowchart of the procedure of clustering the samples and assigning control samples for each study sample from the correlation data based on coverage and fragmentation, according to a preferred embodiment of the method of the present invention.
[0083]
[0084] Figure 7 is a flowchart of the construction process of the reference model linked to each set of control samples, for a set of regions of interest assigned to a work block, according to a preferred embodiment of the method of the present invention.
[0085]
[0086] Figure 8 exemplifies a flow chart to establish whether a position of a region of interest pertains to the work points for the model associated with a reference set, in accordance with a preferred embodiment of the method of the present invention.
[0087] Figure 9 illustrates a flow diagram of the study of the behavior of a signal of coverage against the reference model that corresponds to it (search algorithm), according to a preferred embodiment of the method of the present invention.
[0088]
[0089] Figure 10 presents a flow diagram of processing of an outlier range, according to a preferred embodiment of the method of the present invention.
[0090]
[0091] Figure 11 is an adjustment flow diagram of the limits of the outlier, according to a preferred embodiment of the method of the present invention.
[0092]
[0093] Figure 12 exemplifies a flowchart of an outlier characterization process, in accordance with a preferred embodiment of the method of the present invention.
[0094]
[0095] Figure 13 illustrates a flow diagram for amplification of the lower limit of the range, according to a preferred embodiment of the method of the present invention.
[0096]
[0097] Fig. 14 presents a flow chart for determining if an outlier can reflect a structural variant, according to a preferred embodiment of the method of the present invention.
[0098]
[0099] Figure 15 is a flow diagram of detection and valuation of breakpoints compatible with a structural variant, according to a preferred embodiment of the method of the present invention.
[0100]
[0101] Figure 16 exemplifies a flow chart of calculation and recording of a score that reflects the degree of confidence that the signal behavior effectively for the candidate CNV reflects an underlying structural variant, in accordance with a preferred embodiment of the present method invention.
[0102] Figure 17 illustrates a flow diagram of fusion of CNVs, according to a preferred embodiment of the method of the present invention.
[0103]
[0104] Figure 18 shows a preferred embodiment of the coding with which the scanning means stores the information in the data storage means.
[0105]
[0106] Figure 19 illustrates a flow chart of data flow generation, according to a preferred embodiment of the method of the present invention.
[0107]
[0108] PREFERRED EMBODIMENT OF THE INVENTION
[0109]
[0110] In this text, the term "includes" and its derivations (such as "understanding", etc.) should not be understood in an excluding sense, that is, these terms should not be interpreted as excluding the possibility that what is described and defined can include more elements, stages, etc. Also, descriptions of functions and elements known in the state of the art may have been omitted for clarity and concision
[0111]
[0112] Note that the preferred embodiments of the invention have been described for the case of information extracted by NGS techniques, but can be applied in general to any other genetic analysis technique known in the state of the art. Likewise, the preferred embodiments have been described using specific names of files and variables to facilitate the understanding of the invention, but which in no case limit its scope as it has been claimed, and the invention can be implemented with any other organization or data nomenclature. , files and / or databases that allow implementing the described process. In the same way, the person skilled in the art can understand that modifications can be introduced in the order and / or distribution of the steps described, as well as in the particular mathematical functions implemented, within said scope as it has been claimed.
[0113]
[0114] Figure 1A schematically presents the elements of a first preferred embodiment of the invention, comprising a detection (101) subsystem and annotation (also called detection means), a data storage subsystem (102) (also referred to as storage means of data), and a scanning subsystem (103) (also called scanning means). The detection subsystem (100) detects the structural genetic variants and encodes the resulting information in a structured manner in the data container subsystem (102). In turn, the subsystem of scanning (103) acts as interface with the user, receiving its commands and displaying the corresponding information stored in the data container subsystem (102).
[0115]
[0116] Although the format of the information stored in the data container subsystem (102) may vary between implementations, the byte stream for the entire scan interval associated with each candidate CNV preferably comprises the following metadata: the included positions according to the resolution of Exportation configured in the detection and annotation subsystem, as well as the reference signal and each sample for those positions. For a certain scan interval, the byte stream is formed by a succession of blocks of the same size that depends on the number of samples. Each block contains information associated with a chromosomal position, preferably indicating in its upper digits of the chromosomal coordinate (the whole part of the result of dividing said coordinate by 10000), the 5 least significant digits of said coordinate (the rest of the division of the coordinate between 10000), the reference signal according to the model for said rescaled position to the signal of the study sample for said position, and the re-scaled signals associated to said position for each of the samples considered in the study.
[0117]
[0118] All metadata of a data flow associated with a candidate CNV is provided directly to the scanning subsystem (102), without the need to examine all the data in the container. Once a data flow has been located, the scanning system (102) requests blocks of data associated with specific coordinates, which are provided by means of a random access, that is to say without the need to access the rest of the blocks of the flow.
[0119]
[0120] Note that the main blocks of the system can be implemented following various alternative configurations, independent of the detection technique implemented in the detection subsystem (101). For example, Figure 1B presents an example in which the detection subsystem (101) is configured to generate a result that integrates the data container subsystem (102) and the scanning subsystem (103) into a single file. In the same way, the detection subsystem (101), the data storage subsystem (102) and the scanning subsystem (103) can be integrated in the same equipment, be implemented in various equipment connected through any subsystem of wire communications or wireless devices known in the state of the art, or comprise any additional means of processing, data storage, interaction with the user, etc. known generally in the state of the art. For example, the exploration subsystem (103) may comprise its own viewer, or else communication means that provide a graphical interface to a web browser. According to other alternatives, the scanning subsystem (103) can be a client application of a server of the data container subsystem information (102); or alternatively the scanning subsystem (103) and the data storage subsystem (102) can be integrated into the single autonomous result, which acts as a data provider for the scanning subsystem (103) embedded in the result itself, so that the user can explore the results outside of line with a browser.
[0121]
[0122] Figure 2 presents schematically the steps performed by the detection subsystem (101), in accordance with a preferred embodiment of the method of the invention. Once initialized (200), the steps can be grouped into read (210) the configuration parameters for the detection process, process (220) the information and initialize the process, characterize (230) the samples and assign reference models, detect and characterize (240) CNVs and generate (250) results. The step of reading (210) the configuration parameters includes the configuration and initialization of the process itself and the structuring of the work to be performed, including the following steps: loading (211) the processing parameters (which we will call analysis configuration) ), determine (212) which samples are going to be analyzed and where to locate their NGS data for the analysis, establish (213) the chromosomal intervals on which it is desired to perform the CNV detection process (which we will call region of interest or ROI, of the English "region of interest") and obtain (214) the annotations linked to said region of interest as well as contextual information.
[0123]
[0124] The configuration information of the system can be provided by means of a configuration file, without this excluding doing so through a user interface for this purpose with fields to set the desired parameters. The list of samples, as well as the location of the linked NGS data, can be supplied in the same way, also the region of interest or the annotations. As a particular non-exclusive case, and due to its popularity, the region of interest can be specified in .bed format, as well as the different chromosomal annotations (it also supports, among others, .gff files or other formats such as that used by the USCS , California Santa Cruz University due to its popularity in the field of bioinformatics).
[0125] The basic NGS data for the analysis, related to the samples, typically consist of a file of alignments (called bam) per sample. Said file is a standard that can be obtained by aligning the reads generated during the sequencing using various programs, the process of obtaining the bam being widely described in the state of the art. Once the samples are sequenced, it is enough to configure the execution of a data flow or execution chain (also called "pipeline") prior to the detection process of CNVs to have said data.
[0126]
[0127] The region of interest supplied to the exploration and annotation system consists of one or more chromosomal ranges, for which it is specified: chromosome, chromosomal position at the beginning of the interval, chromosomal position at the end of the interval and two labels. The first label gives name to a series of related regions (for example it can be the name of a gene), the second one, when combined with the first, identifies the region (for example the contiguous coding region number associated with the gene) . Other implementations may comprise alternative and / or additional identification levels. During the loading of the region of interest, those overlapping ranges are merged into new ranges that comprise them, keeping a record of the original regions. A configuration parameter establishes a number of additional base pairs at the intervals supplied to the subsystem and that are incorporated into the region of interest (they are added by the ends of the intervals). When after said addiction the intervals become contiguous or overlap, they merge into a new one, also preserving the record of their original regions. The resulting fused intervals are arranged according to the chromosome they occupy and their initial coordinates. At the end of the process you have a list of intervals ROI = {R1 ... Rn}. For each element Ri of the list of intervals, we have: a name (identifier), a group or category, a chromosome, initial and final chromosomal positions of the range covered, a list of initial coordinates of each of the intervals (no repeated) fused initials and a list in the same way with the corresponding final coordinates.
[0128]
[0129] The annotations are referred to positions or chromosomal ranges, indicating in an equivalent way to the region of interest: chromosome, initial and final chromosomal positions, and type of annotation in question. These annotations include the exonic regions (in chromosomal coordinates), both the coding and untranslated regions or UTRs (from the English Untraslated Regions of the transcripts to be considered and which cover the region of interest.
[0130]
[0131] Next, the step of processing (220) the information and initializing the process comprises generating (221) an execution plan and preparing the workload for its parallelization. The detection (101) subsystem of CNVs divides the work into blocks that can be executed in parallel. The number of bases covered by the region of interest is studied and divided into blocks that are assigned to multiple branches of work, configured in the form of threads, subprocesses, etc. Each branch works sequentially on the blocks assigned to it, each block being a set of ROI intervals so that all the intervals belonging to the same group (for example, to the same chromosome) are contained in a single block. The allocation of blocks to the branches is done in such a way that the difference in base pairs (bp) between the blocks is minimal. The size of the blocks is controlled by a system parameter that defines the bp by iteration and that we will call Cfg.pbPerlteration, so that intervals of a new gene will not be assigned to a block of work if previously this block already exceeds the number of base pairs established by said parameter. This working model optimizes available resources while maintaining the maximum context information (eg contiguous genes) during the processing of the data.
[0132]
[0133] Prior to the search for CNVs the detection and annotation subsystem, the study samples must be modeled. The step of characterizing (230) the samples and assigning reference models includes assigning (231) profiles of coverage, sex and variants; and assign (232) reference sets (control samples) for each sample. This process involves the characterization of the samples, and the assignment of sets of control samples, for each study sample from which a control model is generated. For each sample is recorded: sample identifier, sequencing plate, lane inside the plate (usually known by the term in English 'lane') and the index assigned to the sample, associated alignment file, variant file (for example, an associated .vcf file). During the characterization of the samples are also recorded: the total coverage in the regions of interest (in ROI), the autosomal chromosomes, the X chromosome and the Y chromosome; the associated sex (male, female); the list of linked variants indicating: chromosome, chromosome position, alternative allele frequency and quality. Coverage can be obtained according to the known way general in the state of the art, such as, for example, from reports of coverage resulting from a previously executed analysis pipeline (eg pipeline of detection of non-structural variants), by executing an external program, or by using a code for require such information from the bam files .
[0134]
[0135] After characterizing (230) the samples and assigning reference models, the method comprises executing (241) the detection and annotation process, and generating (251) the corresponding results, before its completion (260).
[0136]
[0137] Figure 3 shows in more detail the process of gender assignment implemented in the step of characterizing (230) the samples and assigning reference models. Sex allocation is based on the differentiated proportion that men and women present in terms of coverage on the X chromosome versus autosomal chromosomes. The steps of said protocol after its initiation are the following:
[0138]
[0139] 1) Calculate (302) the total coverage in the ROI region of interest for each of the samples, on the X chromosome and on the autosomal chromosomes
[0140]
[0141]
[0142]
[0143]
[0144] where:
[0145]
[0146] crom (r) is a function that returns the chromosome on which a region of interest r is located.
[0147]
[0148] RX represents the set of regions of interest located on the X chromosome.
[0149] RA represents the set of regions of interest located on the autosomal chromosomes (ie all excluding the X chromosome and the Y chromosome).
[0150] M represents the set of study samples. My to the sample "i" in study.
[0151] cob (p, m) is a function that returns the coverage in the position p for the sample m
[0152]
[0153] pini (r) and pfin (r) are functions that given a region r belonging to the set of regions of interest ROI respectively return the starting position of said region and the final position.
[0154]
[0155] Cx is the set of total coverages on the X chromosome for each of the samples. Each element of set (Cxi) is the total coverage on the X chromosome for the sample "i" in the study calculated as the sum of the coverages in each position covered by each of the regions of interest located on the X chromosome for the sample "i"
[0156]
[0157] Ca is the set of total coverages in the autosomal chromosomes (all with exception of the X chromosome and the Y chromosome) for each of the samples. Each element of set (Cai) is the total coverage in the autosomal chromosomes for the sample "i" in the study calculated as the sum of the coverages in each position covered by each of the regions of interest located on some autosomal chromosome for the sample "i"
[0158]
[0159] Calculate (303) for each sample the coverage ratios in X against the autosomal chromosomes and order them
[0160]
[0161]
[0162]
[0163]
[0164] where:
[0165]
[0166] Cxj is the total coverage on the X chromosome for the "j" sample.
[0167]
[0168] Caj is the total coverage in the autosomal chromosomes for the "j" sample.
[0169]
[0170] Crj is the ratio for the sample "j" between the total coverage on the X chromosome and on the autosomal chromosomes for said sample.
[0171]
[0172] Cr is the set of ratios of total coverage associated with the different samples of studies ordered from lowest to highest.
[0173] Take (304) a pair of elements of Cr that has not been previously explored, and take (305) the lower value as representative of the ratio associated with men and the superior as representative of the associate of women. If it is the first pair that is taken, initialize values of representation prior to -1.
[0174]
[0175] P = {( Crt, Crj) / j> i A Crt e Cr A Crj e Cr }
[0176]
[0177] r H = Cri, r M = Crj, r H = r M = -1
[0178]
[0179] where:
[0180]
[0181] P is the set of possible pairs, formed by all possible possible pairs of ratios of different total coverage (regardless of the order of the two terms) of the different samples. Crj is the ratio for the sample "j" between the total coverage on the X chromosome and on the autosomal chromosomes for said sample.
[0182]
[0183] rH is a variable that represents the reference ratio for men, which will be recalculated in the different iterations of the method. r'H is the value that rH took in the previous iteration to the current iteration (or -1 in the case of the first iteration).
[0184]
[0185] rM is a variable that represents the reference rate for women, which will be recalculated in the different iterations of the method. r'M is the value that took rM in the previous iteration to the current iteration (or -1 in the case of the first iteration).
[0186]
[0187] Divide (306) the ratios belonging to men or women. If the distance of a ratio to the reference value for men is less than the distance from the reference for women, take it as a man, in another case as a woman.
[0188]
[0189] H = {Crt / lCri - rH < | Cri - rM }
[0190]
[0191] M = {Crt / C r i- rH > Crt - r M }
[0192]
[0193] where:
[0194] Cri is the ratio for the sample "i" between the total coverage on the X chromosome and on the autosomal chromosomes for said sample.
[0195]
[0196] rH is a variable that represents the reference ratio for men.
[0197]
[0198] rM is a variable that represents the reference ratio for women.
[0199]
[0200] H is the set of coverage ratios closest to rH.
[0201]
[0202] M is the set of coverage ratios closest to rM or the same distance as rH.
[0203]
[0204] Eliminate (307) the atypical ratios assigned to the groups. In the cluster of men, those values are atypical for which, when dividing the minimum value in the group of women between these values, the result is greater than 3 units. In the cluster of women will be atypical, those values that, when divided by the maximum value in the group of men, are superior to 3 units.
[0205]
[0206]
[0207]
[0208]
[0209] where:
[0210]
[0211] Cri is the ratio for the sample "i" between the total coverage on the X chromosome and on the autosomal chromosomes for said sample.
[0212]
[0213] min (X), max (X) are functions that return the element with the lowest value and the highest value of an X set, respectively.
[0214]
[0215] Hok is the set of ratios belonging to H such that the distance between the lowest value in M and them is less than 3 times the value of these ratios. The elements of Hok represent valid values in H (excluding outliers).
[0216] Mok is the set of ratios belonging to M such that the distance between the highest value in H and them is less than 3 times the value said maximum value in H. The elements of Mok represent valid values in M (excluding atypical values).
[0217] If any of the sets, after removing the outliers, is empty and there are pairs of values to explore, go back to step 3, taking a new pair. If there are no couples left to explore, assign (308) unknown sex to each sample, concluding (309) the protocol.
[0218]
[0219] In the case that the sets of values for men and women without atypical values are not fords, recalculate (310) the representative values for each group as the average of said sets. Also, the current values are taken as previous values.
[0220]
[0221]
[0222]
[0223]
[0224] where:
[0225]
[0226] rH is a variable that represents the reference ratio for men, which will be recalculated in the different iterations of the method. r'H is the value that rH took in the previous iteration to the current iteration (or -1 in the case of the first iteration).
[0227]
[0228] rM is a variable that represents the reference rate for women, which will be recalculated in the different iterations of the method. r'M is the value that took rM in the previous iteration to the current iteration (or -1 in the case of the first iteration).
[0229]
[0230] TU and M0k represent the average value of the elements belonging to the set Hok and the set Mok respectively.
[0231]
[0232] If the recalculated values match the previous ones:
[0233] to. If the quotient between the representative ratio of women and men is within the range [1.8, 2.2], divide (311) the samples based on the proximity of the ratios associated with the reference values for men and women.
[0234]
[0235]
[0236]
[0237] where:
[0238] Cri is the ratio for the sample "i" between the total coverage on the X chromosome and on the autosomal chromosomes for said sample.
[0239]
[0240] rH is a variable that represents the reference ratio for men.
[0241]
[0242] rM is a variable that represents the reference ratio for women.
[0243]
[0244] H is the set of coverage ratios closest to rH considered ratios belonging to samples of male sex.
[0245]
[0246] M is the set of coverage ratios closest to rM or the same distance as rH, considered ratios belonging to female samples.
[0247]
[0248] b. If the quotient is not within the range [1.8, 2.2]: if there are pairs go back to step 3 taking (304) a new pair; otherwise, assign (308) unknown sex to each sample by concluding (309) the protocol.
[0249]
[0250] Note that the registered sex is a function of the number of copies of the X chromosome, and allows to correct the natural imbalance for said chromosome between both sexes when comparing the signals linked to the samples in said chromosomes. A sample without a signal on the X chromosome will be taken as a Man and a sample with 2 or more copies of the X chromosome will be taken as a woman (regardless of whether the phenotypic sex is male - eg Klinefelter smdrome -XXY-).
[0251]
[0252] Figure 4 shows the process of assigning sets of control samples to each study sample, performed by iterative clustering on the analysis samples (until reaching a stop condition to determine the final groups.) The procedure begins (401) by taking ( 402) the ROI normalized interest regions , ordered by chromosome and initial position, a first correlation matrix (MRC) is calculated (403) for the set of samples analyzed based on their coverage profile and a second (404) is calculated Correlation matrix (MRT) for the set of samples analyzed based on the size distribution of the aligned fragments associated with each sample For each sample under study (ME), the samples are divided into clusters (405) based on the pair of values of coverage correlation and fragmentation (R = {Ri / Ri = (RCi.E, RTi.E)}, where R is the set of pairs of values possible such that each pair represents the correlation according to the MRC matrix and MRT of the sample i with the study sample (E)). Note that the sample under analysis with itself is the pair (1,1). By performing the iterative clustering process described below, a set of control samples is selected (406), which we will call M.set. When the process has been carried out iteratively for all the samples under study, this process is completed (407).
[0253]
[0254] Figure 5 shows in more detail the step of calculating (403) the correlation matrix based on the coverage profile. The procedure begins (501) by loading (502) the normalized interest regions, the number of desired coverage sampling sites to calculate the correlations based on the coverage profile and the study samples. Next, the size in base pairs of the region of interest (PBR) is calculated (503) as:
[0255]
[0256] PBR = Ef = i (p end (Ri) - p in i (Ri) 1).
[0257]
[0258] where:
[0259]
[0260] PRB is the base pair size of the region and interest.
[0261]
[0262] pini (r) and pfin (r) are functions that given a region r belonging to the set of regions of interest ROI respectively return the start position of said region and the final position.
[0263]
[0264] A sampling interval (IM) equal to the whole part of the quotient between PBR and the configuration parameter is also calculated, which indicates the number of desired sampling sites (Cfg.MUESTREO) to perform the correlation of the samples based on the coverage IM: = foor (PBR / Cfg.MUESTREO, where the function floor returns the whole part of a number). In case the quotient is less than one, it will be taken as the sampling interval 1. With this information, the list of chromosomal positions for which the coverage is calculated in each sample is constructed and on which the correlation matrix is calculated. For each of the intervals Ri contained in ROI, the number of sites to be sampled (np) is calculated (505), the whole part of the quotient between the size of the interval Ri in base pairs and the sampling interval np = / 7oor (( p / m (R ') ~ ^ mt (R') + 1)). If for Ri np is 0, the intermediate chromosome position of the interval Ri is calculated (506) as p = postinitial (Ri) floo r ( p / 'mai (^) - P t 1 ) and added to the list of positions of sampling the positions in the interval [-3, p + 3] for the chromosome linked to R |. If for Ri np is greater than 0, the positions in the interval are added (507) to the list: [ini (Ri) (np i) - 3, pini (Ri) (np i) 3] varying the value for i in 1 from 0 to np and for the chromosome linked to Ri. Once the list of chromosomal positions to be sampled is obtained, the coverage in these positions for each of the study samples is obtained (508).
[0265]
[0266] For each of the samples m, the set of variation rates is calculated (509)
[0267] each of the sampling positions as D = { NCmfqr ™ crf) Mv '> / pe SM}, where
[0268] NCm, p is the coverage obtained for the sample m in the position p divided by the total coverage for said sample in the autosomal chromosomes, and NCMp is the set of NCm values, p for each sample m of the total set of samples (M ). Considering 5 = med (D) where med represents the median, £ = iqr (D) where iqr represents the interquartile range, the correlation rate of each sample is calculated with the others considering the coverages for those positions in SM such that Dm, p <5 £, thus conforming the matrix correlations.
[0269]
[0270] For the calculation of the correlation matrix based on the fragmentation profile of the samples, the alignment files are processed for each sample, registering for each of them, the number of aligned fragments of a certain size. This size corresponds to the data contained in the alignment files themselves. Once the frequencies are obtained by fragment size, for each sample the correlation matrix of said frequencies is calculated for the analysis samples.
[0271]
[0272] Figure 6 shows in detail the procedure of clustering of the samples and the assignment of control samples for each study sample from the correlation data based on coverage and fragmentation. The process is started (601) by taking (602) the minimum number of control samples for a given sample (Cfg.SMIN), M = {M1 ... Mn} as the samples analyzed, ME = Mi as the study sample , and R = {Ri / Ri = (RCi-E, RTi-E)} as the pairs of correlations (RC, RT) of each sample with the study one. Next, the Euclidean distance of each pair Ri with respect to that corresponding to the study sample is calculated (603). For the sample i the Euclidean distance is then VC RCi. E - 1) 2 (RTi E - 1 ) 2.
[0273] where:
[0274]
[0275] RCi-E is the coefficient of correlation in terms of the coverage profile according to the corresponding sampling points between the sample i and the study sample (E). In the preferred implementation, the correlation coefficient is the pearson correlation coefficient.
[0276]
[0277] RTi-E is the correlation coefficient in terms of the fragmentation profile between the sample i and the study sample (E). In the preferred implementation, the correlation coefficient is the pearson correlation coefficient.
[0278]
[0279] Beginning with two clusters (604), a grouping algorithm, such as kmeans, is applied to group (605) the samples based on the fragmentation and coverage correlation data. The centroids are initialized to the correlation values corresponding to the k pairs with the lowest Euclidean distance with respect to (1,1) without including this pair, and the same strategy is applied increasing the number of clusters in one of each time until the pair (1.1) corresponding to the sample being studied is the only member of the cluster to which it has been assigned when making the partition in k clusters. Take k-1 as the ideal number of clusters. If in the iteration with the ideal number of clusters the cluster assigned to the study sample has a number of members greater than the number of minimum control samples configured in the subsystem (by default two samples), they are recorded as control samples for the study sample the samples whose corresponding correlation pairs have been assigned to the same group as that of the control sample (without including the study sample itself as a control). In case the number of members in iteration k-1 does not exceed the configured maximum, an exception is reported (606) that alerts this situation and retrieves the iteration with the number of clusters closest to k-1 that contains a number members higher than that required for the cluster corresponding to the study sample. The rest of the samples from this group are taken (607) as controls for the study sample and the process is finished (608). Note that other techniques of clusterization in bases as their profiles of coverage and fragmentation can be used alternatively.
[0280]
[0281] After the characterization and modeling of the samples, each executing branch, in parallel, sequentially carries out the detection and annotation process of CNVs for each one of the corresponding work blocks. This process consists of two phases:
[0282] a) Generation of reference models for the comparison of signals associated with the samples.
[0283] b) For each sample, study of the behavior of the associated coverage signal to detect deviations from the reference model; record of the affected intervals, characterizing them and selecting among them those candidates to be affected by structural variants; valuation of said affectation and recompilation of the relevant information for the generation of results.
[0284]
[0285] Figure 7 shows the construction process of the reference model linked to each set of control samples Si for a set of regions of interest R1 ... Rn assigned to a work block. The process starts (701) loading (702) the following variables:
[0286]
[0287] ROI = {R1, R2, ..., Rr}: Regions of the iteration ordered by chromosome and initial position.
[0288]
[0289] M = {M1, M2, ..., Mm}: Set of study samples.
[0290]
[0291] B = {Bi / Bi = bam (Mi)}: Set of alignments for the samples, Bi is the alignment for the sample i.
[0292]
[0293] S = {S1, S2, ..., SS}: Set of different sets of control samples (reference sets) assigned during the corresponding procedure already described.
[0294]
[0295] RC, RNC: Raw coverage matrices and normalized coverage respectively (m x positions (Ri)) by region.
[0296]
[0297] RRF, RUP, RDW, RVAR: Matrices associated to the control structure or model (sx positions (Ri.)), Respectively: Standard reference coverage for the model, upper limit to consider a normal coverage (not atypical) according to the model , lower limit to consider a normal coverage according to the model, variability associated with the model. s is the number of control sets generated.
[0298]
[0299] r: active region, initialized to 1.
[0300]
[0301] R: number of regions
[0302]
[0303] The following steps are followed below:
[0304]
[0305] 1. For each sample m:
[0306] to. For each chromosomal position p in the interval corresponding to the first region R1, obtain (702) the coverage for each study sample by consulting the corresponding alignment ( bam file ) (Bm). Register these coverages linked to sample, position and region as raw coverage.
[0307]
[0308] RCr_ CM x pos (Rr) / Cm, p = getcov (Bm, p)
[0309]
[0310] where:
[0311]
[0312] getCov (Bm, p) is a function that returns the coverage in the position p according to the alignment Bm.
[0313]
[0314] RCr = CM x pos (Rr) is a matrix with as many rows as study samples and as many columns as positions have a region r where each value Cm, p is calculated as getCov (Bm, p).
[0315]
[0316] b. Normalize (703) the raw coverages. The standardized coverage for a position and study sample is calculated by dividing the raw coverage by the total coverage in the autosomal regions of interest for said sample and additionally by 2 when the region to which the position belongs is located on one of the sex chromosomes. and the sex associated with the sample is female. Register such coverage linked to sample, position and region as standard coverage.
[0317]
[0318] _____ Cm, p _____
[0319] RNCr = NCM x pos (Rr) / NCm, p cob A u to somal ( Mm) * 9
[0320]
[0321] 9 = 2 if sex (Mm) = Woman to chrom (Rr) and {X, Y}
[0322]
[0323] 9 = 1 another case.
[0324]
[0325] where:
[0326] RNCr = NCm x pos (Rr) is a matrix with as many rows as study samples and as many columns as positions have a region r whose values are the corresponding standardized coverage.
[0327] Cm, p represents the raw coverage in the position p for the sample m.
[0328] cobAutosomal (m) is a function that returns the total coverage in the autosomal chromosomes and the region of interest for the sample m.
[0329]
[0330] sex (m) is a function that returns the sex associated with a sample m.
[0331]
[0332] crom (r) is a function that returns the chromosome associated with the function r.
[0333]
[0334] 9 is a factor of correction of the number of differential copies for X in men and women, it takes value two in the case of regions on the X chromosome when the gender of the sample under calculation is female, otherwise it is worth 1.
[0335]
[0336] For each reference set If you generate (705) the data associated with your model:
[0337] to. Linked to position and region is calculated and recorded:
[0338] i. The reference signal (standardized coverage) for the model. The reference signal for a given position is the median of the standardized coverages associated with said position (one for each study sample). However, other realizations could work with other statistical parameters such as average, truncated average, etc.
[0339]
[0340] RRFr = RFs x pos (Rr) / RFs, p = med ( {NCm, p / Mm e Ss})
[0341]
[0342] where:
[0343]
[0344] RRCr = FRs x pos (Rr) is a matrix with as many rows as reference sets and as many columns as positions have a region r, for a row associated with the reference set s, the values of the columns correspond to the median of the coverages normalized of the samples belonging to each of the positions of region r.
[0345]
[0346] med (X) represents the median of the values of the set X.
[0347]
[0348] ii. Upper limit of variation typical of the reference signal, calculated as the value that determines the first quartile of the values associated with the standardized coverage for a given position. However, other realizations could work with other statistical parameters such as standard deviation, variance, etc.
[0349]
[0350] RUP r- UP sx pos (Rr) / UPssp Q1 ( {NCm, p / Mm e Ss})
[0351]
[0352] where:
[0353]
[0354] RRCr = UPs x pos (Rr) is a matrix with as many rows as reference sets and as many columns as positions have a region r, for a row associated with the reference set s, the values of the columns correspond to the median of the standardized covers of the samples belonging to each of the positions of the region r.
[0355]
[0356] Q1 (X) represents the first quartile of the values of the set X.
[0357]
[0358] iii. Lower limit of variation typical of the reference signal, calculated as the value that determines the third quartile of the values associated with the standardized coverage for a given position.
[0359]
[0360] RDWr- DWs x pos (Rr) / DWs, p Q3 ({NCm, p / Mm e Ss})
[0361] where:
[0362]
[0363] RDWr = DWs x pos (Rr) is a matrix with as many rows as reference sets and as many columns as positions have a region r, for a row associated with reference set s, the values of the columns correspond to the third quartile of the standardized coverages of the samples belonging to each of the positions of the region r.
[0364]
[0365] Q3 (X) represents the third quartile of the values of the set X.
[0366]
[0367] iv. Typical variation of the reference signal for a position calculated as the interquartile range of the values associated with the standardized coverage for said position.
[0368]
[0369] RVRr_ VRsx pos (Rr) / VRs, p _ UPs x pos (Rr) - D ^ Vs x pos (Rr)
[0370]
[0371] where:
[0372]
[0373] RVRr = VRs x pos (Rr) is a matrix with as many rows as reference sets and as many columns as positions have a region r, for a row associated with reference set s, the values of the columns correspond to the interquartile range of the standardized covers of the samples belonging to each of the positions of the region r.
[0374]
[0375] v. If the position belongs to the work positions, (706) work points (also quoted as working points) linked to the model are registered, ie the signals in said region meet the necessary criteria to consider their satisfactory measurement for the detection procedure. .
[0376] 3. Repeat the previous steps for each of the regions included in the work block, recording the data of each model for each position of each region. When these regions are finished (r = R), the process ends.
[0377]
[0378] Figure 8 presents the process to establish whether a position p of a region of interest R belongs to the work points for the model associated with a reference set Si. The determination of work points supposes a strategy of selection of positions of analysis of the signals, to control the effect of the experimental variability and certain artifacts that are produced by the design and the technology used. A position is considered a work point if it meets certain conditions, some of which involve the study of its local chromosomal context. The range of positions included in the local chromosome context is determined by a prefixed window of positions (we will call this variable Cfg.WPwindow). The process starts (801) loading (802) the following variables:
[0379]
[0380] M = {M1, M2, ..., Mm}: Study samples.
[0381]
[0382] MS = {Mi / Mi £ S} controls for the model S.
[0383]
[0384] R = [1. p n], S: Region covering a succession of positions p1 ... pn and control samples for the model for which the Work Points are calculated.
[0385]
[0386] RF, VR: vector standard reference coverage model and variation for the region Ry the model S.
[0387]
[0388] WP: Work points associated with region R and model S.
[0389]
[0390] WP is initialized to 0 and p is initialized to 1.
[0391]
[0392] Then the following steps are executed
[0393]
[0394] 1. For a model, the average of the correction factors of coverage (fc) for the normalization of the control samples is calculated (803):
[0395]
[0396]
[0397]
[0398]
[0399] 9 = 2 if sex (MSi) = Woman to chrom (Rr) and {X, Y}
[0400]
[0401] 9 = 1 in another case.
[0402] where:
[0403]
[0404] MS = Is the set of samples that make up the reference set Si. SMi represents the sample i included in said set and | MS | is the cardinal of it (number of samples included in this).
[0405] cobAutosomal (m) is a function that returns the total coverage in the autosomal chromosomes and the region of interest for the sample m.
[0406]
[0407] sex (m) is a function that returns the sex associated with a sample m.
[0408]
[0409] crom (r) is a function that returns the chromosome associated with the function r.
[0410]
[0411] 9 is a factor of correction of the number of differential copies for X in men and women, it takes value two in the case of regions on the X chromosome when the gender of the sample under calculation is female, otherwise it is worth 1.
[0412]
[0413] If the size in base pairs of the region under study is equal to or less than that determined by the configuration parameter Cfg.WPwindow, any position in the region is considered local for another of the same:
[0414] to. Obtain (804) the variation rates of the signal associated with the model in the region, considering only the values corresponding to "well-measured" positions. "Well-measured" positions are those positions that belong to the original interest intervals recorded for the region R and for which the standard reference coverage for the model multiplied by fc (the average correction factor) is equal to or greater than the limit coverage established by the configuration parameter Cfg.WPmincover (default 50). Variation for a position is determined by the quotient between the variation and the reference signal associated with the model for said position.
[0415]
[0416] Rates Variacionok = {- RFi / (RFi * fc> cfg.WPmcb) a (pi c U R. icore Jj )}
[0417] where:
[0418]
[0419] R.icorej are the regions originally selected as the target for the analysis associated with a study region that the method has generated from them. U R.icorej represents the set (union) of all these regions.
[0420]
[0421] pi represents a given position "i" included in the region under study.
[0422] RFi is the reference value for the model in the "i" position
[0423]
[0424] VRi is the variation associated with the model for the "i" position
[0425] VariablesVariad6nok = It is the set of quotients of the variation between the reference value (that is, the variation rates) for the model calculated for the "i" positions that meet the aforementioned conditions.
[0426]
[0427] b. Register (805) as work points those "well-measured" positions in the region, whose variation rate is lower than the average calculated on the SetVariacionok set defined above plus the standard deviation also calculated for Variation Rates and multiplied by a factor (established with Cfg .WPstringence, by default 2), as long as there is more than one "well-measured" position in the region, otherwise no position will be considered "work point".
[0428]
[0429] WP if TVp < TVok or (TV 0k). cfg.WPstringence
[0430]
[0431] where:
[0432]
[0433] TVp represents the rate of variation associated with a reference model for the position p.
[0434]
[0435] TVok is an abbreviation of the RateVariacionok defined above. TVok represents the average of the set of variation rates ok calculated for the region under study, or (TV0k) represents the typical deviation for that set.
[0436] WP is the set of work points for the region and model under study.
[0437]
[0438] the size of the region is greater than Cfg.WPwindow.
[0439] to. Calculate (806) the initial variation rate limit limizq as the average of the variation rates linked to the "well measured" positions between the first Cfg.WPwindow positions of the region the typical deviation by the coefficient established in Cfg.WPstringence. If there are not two or more "well-measured" positions between the first positions, limizq will be taken as 0.
[0440]
[0441] lim izq = TVok_izq or (TV0k_izq ). cfg. WP stringence , limizq = 0 if TVok-izq = 0
[0442]
[0443] where:
[0444]
[0445] TVok-jzq represents the set of Variation Rates calculated only considering the first cfgWPwindow positions of the region under study. TVok_izq represents the average of said rates and o (TV0k_izq) the typical deviation.
[0446]
[0447] Limit the Kmite value for the variation rate of a position located between the first cfg.WPwindow of the region under study to be considered a work point or (WP, working point).
[0448]
[0449] b. Calculate the limit variation rate limder as the average of the variation rates linked to the "well measured" positions between the last Cfg.WPwindow positions of the region the typical deviation by the coefficient established in Cfg.WPstringence. there are not two or more positions "well measured" between the last positions, limizq will be taken as 0.
[0450]
[0451] limder = TVok_der or (TV0k_der) . C fg. WP stringence, lim der = 0 if TV ok-der = 0 where:
[0452]
[0453] TVok-der represents the set of Variation Rates calculated only considering the last cfgWPwindow positions of the region under study. TVok_der represents the average of said rates and o (JV0k- der) the typical deviation.
[0454]
[0455] limizq is the limit value for the variation rate of a position located between the last cfg.WPwindow of the region under study to be considered a work point or (WP, working point).
[0456]
[0457] c. Take as WP those "well-measured" positions of the first ceil (Cfg.WPwindow / 2) positions of the region whose variation rate is less than limizq, where the ceil (x) function rounds to the whole number equal to or greater than x small possible.
[0458] d. Take as WP those "well-measured" positions among the last ceil (Cfg.WPwindow / 2) positions of the region whose variation rate is less than limder.
[0459] and. For each intermediate position p of the region [not included between the first ceil (Cfg.WPwindow / 2) nor final ceil (Cfg.WPwindow / 2) positions of the region]:
[0460] i. Calculate the rate of limit variation for the position p limp.
[0461] Limp is calculated by taking the "well-measured" positions in the interval [- Cfg.WPwindow / 2, p Cfg.WPwindow / 2], when this set is two or more positions limp is the average of the variation rates linked to said positions plus the typical deviation multiplied by Cfg.WPstringence, otherwise it is 0.
[0462] ii. Incorporate (807) the position p as "work point" when its variation rate is less than limp.
[0463]
[0464] limp = TVok_p or (TV 0k_p) .cfg.WPstringence, limp = 0 if TVp = 0
[0465] where:
[0466]
[0467] TVok-p represents the set of Variation Rates calculated only considering the positions in the range [-Cfg.WPwindow / 2, p Cfg.WPwindow / 2] of the region under study. TVokp represents the average of said rates and o (TV 0k_v) the typical deviation.
[0468]
[0469] limp is the limit value for the variation rate of a position located within the range [- Cfg.WPwindow / 2, p Cfg.WPwindow / 2] of the region under study to be considered a work point or (WP, working point).
[0470]
[0471] The process ends (808) when verifying p = n, where n is the number of positions in the region under study.
[0472]
[0473] Figure 9 presents the study, for each sample, of the behavior of its coverage signal against the reference model that corresponds to it, constitutes the search algorithm, once the coverage for the study samples in the regions belonging to a block has been obtained. of work, and calculated the models corresponding to the different control sets. The study protocol reviews the signals of the samples along the valid positions ("work points"), in the regions of interest, for the detection of candidate CNVs, registering the chromosomal ranges with signals altered with respect to the signal of reference for the corresponding model and characterizing said ranges to finally merge (907) those that, it is considered, are part of a larger area that would be altered and includes them.
[0474]
[0475] The process begins (901) by initializing (902) the following variables:
[0476]
[0477] ROI = {R1, R2, ..., Rn}: Regions of the iteration ordered by chromosome and initial position.
[0478]
[0479] extensions: = Cfg.maxExten: Variable that indicates the number of extensions available to control the flow of the algorithm and that is initially set according to a configuration parameter. An extension supposes to consider a position that does not have a signal altered based on the corresponding model as a candidate for a structural variant event.
[0480]
[0481] outlierActivo: = NO: Variable of logical type that during the search process indicates if at a given moment the examination of a position constitutes the continuation of a detected alteration in case it is also this or is the first of a new block.
[0482] wpexten = 0; wpTam: = 0: These are countable variables, the first indicates the number of extensions that have been used and the second indicates the number of work points covered.
[0483] Then, the process is carried out in the following way:
[0484]
[0485] 1. Carry a counter of ignored unaltered positions (extensions) that is initialized to the maximum value that can be reached as configured in Cfg.maxExtensions and a situation record (outlierActive) that initially is in an "inactive outlier" state. Some work point counter records: extended (wpexten) and covered (wpTam), both registers are initialized with value 0.
[0486] 2. Position yourself in the first position of the first region assigned to the work block (according to the order chromosome-coordinate chromosome).
[0487] 3. If the situation record is active (outlierActive = YES), when the group to which the current region belongs is different from the group assigned to said outlier, close (903) the outlier record (pass status to "inactive outlier") assigning as final position the last significant position and as the final region that contains it, then if the number of work points covered (wptam-wpext) is equal to or greater than Cfg.minTamano, process (904) said outlier (the process to process an outlier is described below).
[0488] 4. Verify if the current position is valid (work point), if not update the current position and the current region to advance to the next position of the regions to be processed from the work block and return to step 3.
[0489] 5. Increase the value of wpTam in one unit (the current position is WP).
[0490] 6. Calculate (905) the distance between the sample signal and the reference in the model for the position. The distance is calculated as the difference in absolute value between the standard coverage of the sample and the reference for your control model divided by the variation of the model. When the standard coverage of the sample is greater than or equal to the reference for the model, the sign of the distance is positive, if not negative.
[0491]
[0492]
[0493] sd = signodistance (p) = " + " if NCm p> RFs p but = "-"
[0494]
[0495] where:
[0496]
[0497] NCmp is the normalized coverage at the position p for the sample m.
[0498] RFsp is the reference value for ls position p according to control set s (assigned to sample m).
[0499]
[0500] VFsp is the variation value for ls position p according to control set s (assigned to sample m).
[0501]
[0502] In case the current situation is "active outlier":
[0503] 7.1. If the sign associated with the active outlier differs from the sign associated with the current position (sd) close the record of the outlier (pass state to "inactive outlier") assigning the final position as the final position and as the final region the one that contains it, then if the number of work points covered (wptam-wpext) is equal to or greater than Cfg.minTamano, process said outlier and then continue in step 9.
[0504] 7.2. If the sample-reference distance (d) is less than the limit set by the configuration parameter Cfg.distmin (default 1.5), when the value of the counter extensions is greater than zero, decrease it by one unit, increase the counter of work points extended and continue in step 9, otherwise continue to close the record of the outlier (pass state to "inactive outlier") assigning the last final position as the final position and as the final region that contains it, then if the number of points of work that covers (wptam-wpext) is equal to or greater than Cfg.minTamano, process said outlier and then continue in step 9.
[0505] In the event that the current situation is "outlier inactive", if the distance sample reference (d) is equal to or exceeds the established limit (Cfg.distmin), open (906) a new record of outlier (status to "active outlier") ) associating it as initial region, initial position, chromosome, group and signodistance those linked to the current position and set to Cfg.maxExtensions.
[0506] 9. In case there are remaining positions in the regions assigned to the block of work to be explored, take the next one to the current one, in the following region if necessary, and continue in step 3. If there are no valid positions by explore, when the current situation is "active outlier" close the record of the outlier (pass state to "inactive outlier") assigning the last final position as final position and as the final region that contains it, then if the number of points of Work that covers (wptam-wpext) is equal to or greater than Cfg.minTamano, process said outlier.
[0507] 10. Merge (907) those outliers that are part of a larger area that would be altered and include them, and finalize (908) the protocol.
[0508]
[0509] Figure 10 shows the processing protocol of an outlier range, whose objective is to reveal the presence of a CNV, assign a confidence value in which the chromosomal range with altered signals corresponds to a range affected by a CNV, and recover and record contextual information to inform the candidate CNV. The process starts (1001) by adjusting (1002) the limits of the outlier region. In case the range continues exceeding the established length in a threshold that we will call Cfg.minTamano, it proceeds to its characterization (1003), otherwise it is dismissed as CNV. The characterization is based on the measurement of certain parameters of the model and the sample, as well as the chromosomal region where the outlier is located. Once the outlier range has been characterized, it is determined (1004) if it is a CNV candidate, otherwise it is rejected. The candidate ranks for CNVs are annotated (1005) with information of the experimental and chromosomal context (for example: affected genetic zones, chromosomal regions with special characteristics, variants in said region, times the altered region has been seen, information of databases on said region). Once all the data have been obtained, the signals are exported (1006) encoded in the context of the affected and neighboring area, to be recovered during the creation of the final report. The last step of the process is responsible for assigning (1007) a degree of confidence that the candidate CNV is really a structural variant, after which the process ends (1008).
[0510]
[0511] Figure 11 shows the adjustment of the limits of the outlier, which begins (1101) with the estimation (1102) of the sample-model ratio characteristic of the outlier, R. R is calculated as the median of the set of normalized coverage ratios of the sample between the reference for the corresponding model, considering the positions that are work point within the outlier interval initially detected.
[0512]
[0513]
[0514]
[0515]
[0516] where:
[0517]
[0518] NCmp is the standardized coverage at position p for sample m.
[0519]
[0520] RFs p is the reference value for ls position p according to control set s (assigned to sample m).
[0521]
[0522] WP is the set of work points for the sample under study and p a position included in said set.
[0523]
[0524] Oini and Ofin are respectively the start and end position associated with the outlier.
[0525]
[0526] If R is greater than 1 the outlier corresponds to a signal gain. When the ratio associated with the initial position is less than or equal to (1 + R) / 2, the starting position is recalculated in a position until the starting position of the region in which the initial outlier begins or the starting position is reached. the ratio NRCFms ,, p p for the new position does not exceed (1 + R) / 2. When the ratio associated with the initial position exceeds (1 + R) / 2, the starting position is recalculated by increasing it in one position while the ratio ^^ m, p for the new position is
[0527] lower than (1 + R) / 2 and ls lower division to the last one belonging to the initial region for the outlier. For the recalculation of the final position, proceed in reverse. When the ratio associated with the final position is greater than (1 + R) / 2, the end position is recalculated by increasing it in one position until the end position of the region in which the initial outlier ends or the ratio is reached. NRCFms ,, p p for the new position does not exceed (1 + R) / 2. When the ratio associated with the final position does not exceed (1 + R) / 2, the end position is recalculated in one position while the ratio for the new position is lower RFs , p
[0528] that (1 + R) / 2 and the last position assigned to the last region affected by the initial outlier has not been reached.
[0529]
[0530] When R does not exceed the unit the outlier corresponds to a signal loss. When the ratio associated with the initial position is less than (1 + R) / 2 the position is recalculated start by decreasing it in a position until the starting position of the region in which the initial outlier starts or the RFs ratio is reached, p for the new position is not lower than (1 + R ) / 2. When the ratio associated with the initial position is equal to or exceeds (1 + R ) / 2, the starting position is recalculated increasing it in one position while the ratio N ,
[0531] RCFms, pp for the new position exceeds (1 + R ) / 2 and the final position of the initial region assigned to the outlier has not been reached. For the recalculation of the final position, proceed in reverse. When the ratio associated with the final position is less than (1 + R ) / 2, the end position is recalculated by increasing it in one position until the end position of the region in which the initial outlier ends or the ratio is reached. N RC , Fms, pp for the new position is not less than (1 + R ) / 2. When the ratio associated with the final position is not less than (1 + R ) / 2, the end position is recalculated by decreasing it in one position while the ratio N RCFm , s, pp for the new position is greater than (1 + R ) / 2 and the final position of the last region affected by the outlier has not been reached. After the update (1103) of Oini and Ofin, the process ends (1104).
[0532] Completed the adjustment of the limits of the outlier if the number of pairs that it covers is lower than the value defined by Cfg.minTanano, the interval is discarded as CNV and the search is continued if it is the case, as has been described of other affected intervals, when said minimum size is satisfied begins the characterization of the outlier, as shown in figure 12.
[0533]
[0534] The process begins (1201) by loading (1202) the following variables:
[0535]
[0536] O, m, s: Outlier that is being characterized, sample and model linked.
[0537]
[0538] WP: Work points for s in the chromosomal range covered by O.
[0539]
[0540] NCm, p: Standardized coverage for the sample (m) in the p position.
[0541]
[0542] RFs p: Standard reference coverage for s in the p position.
[0543]
[0544] VRsp: Variation (interquarter range) associated in s to the position p.
[0545]
[0546] Then, the sex is maintained (1203) or corrected (1204):
[0547]
[0548] y = 2 if sex (Mm) = Woman to chrom (Rr) and {X, Y}, <p = 1 another case.
[0549]
[0550] The characterization consists in registering the following parameters linked to the outlier O for a sample and its corresponding reference model s:
[0551] a) Coverage (1205) of the characteristic model for O, O.cob. This value is calculated as the product of the total autosomal coverage of the sample by the median of the standard reference covers for the model linked to each of the positions included as Work Points in the chromosomal range covered by O. The result of the previous product is further multiplied by 2 in case the sex assigned to the sample m is female and the chromosomal range linked to the outlier is located on a sex chromosome.
[0552]
[0553] O. cob = 9. cobAuto (m). md ([RFsp / p E WP })
[0554] where:
[0555]
[0556] 9 is a factor of correction of the number of differential copies for X in men and women, it takes value two in the case of regions on the X chromosome when the gender of the sample under calculation is female, otherwise it is worth 1.
[0557]
[0558] cobAuto (m) is a function that returns the total coverage in the autosomal chromosomes and the region of interest for the sample m.
[0559]
[0560] p is a position belonging to the set of work points.
[0561] md (X) is median for the set X.
[0562]
[0563] RFsp is the reference value for the position p according to the control set s (assigned to sample m).
[0564]
[0565] WP is the set of work points for the sample under study and p a position included in said set.
[0566]
[0567] b) Variation (1206) of the characteristic model for O, O.var. This value is calculated as the median of the standardized coverage variations for the model linked to each of the positions included as Work Points in the chromosomal range covered by O
[0568]
[0569] O. var = md ([VRs p / p E WP })
[0570] where:
[0571]
[0572] p is a position belonging to the set of work points.
[0573]
[0574] md (X) is median for the set X.
[0575]
[0576] VRsp is the variation linked to the position p according to the control set s (assigned to sample m).
[0577]
[0578] WP is the set of work points for the sample under study and p a position included in said set.
[0579]
[0580] c) Rate of variation (1207) of the characteristic model for O and its variation along O, O.var and O.vtvar. O.var is calculated as the median of the quotients between normalized coverage variations and the standardized reference coverage values for the model linked to each of the positions included as Work Points in the chromosomal range covered by OOvtvar is calculated as the Interquartile range for the set of quotients considered for the calculation of O.var.
[0581]
[0582]
[0583]
[0584]
[0585] where:
[0586]
[0587] p is a position belonging to the set of work points.
[0588]
[0589] md (X), iqr (X) are respectively the median and the interquartile range for the set X.
[0590]
[0591] RFsp is the reference factor for the position p according to the control set s (assigned to sample m).
[0592]
[0593] VRs, p is the variation linked to the position p according to the control set s (assigned to sample m).
[0594] WP is the set of work points for the sample under study and p a position included in said set.
[0595]
[0596] d) Coverage ratio (1208) characteristic for O and its variation along O, O.ratio and O.ratiovar. O.ratio is calculated as the median of the quotients between the standardized coverage for the sample and the reference for the linked model, considering the positions included as Work Points in the chromosomal range covered by OOratiovar is calculated as the interquartile range for the set of quotients considered for the calculation of O.ratio.
[0597]
[0598]
[0599] O. r a t i o = md ({™, v / p e WP })
[0600] RFsp
[0601] i N v C p
[0602] O. r a t i o v a r = t q r ({'/ p e WP }) RFsp
[0603]
[0604] where:
[0605]
[0606] p is a position belonging to the set of work points.
[0607] md (X), iqr (X) are respectively the median and the interquartile range for the set X.
[0608]
[0609] RFsp is the reference factor for the position p according to the control set s (assigned to sample m).
[0610]
[0611] NCmp is the normalized coverage at the position p for the sample m.
[0612]
[0613] WP is the set of work points for the sample under study and p a position included in said set.
[0614]
[0615] e) Characteristic distance (1209) for O and its variation along O, O.distancia and O.distanciavar. The distance is calculated as the median of the quotients between the differences of standardized coverage, for the sample and the reference value according to the linked model, and the variations of standardized coverage for the model in the positions considered. Work points in the interval The chromosome covering OOdistanciavar is calculated as the interquartile range for the set of quotients considered for the calculation of O.distance.
[0616]
[0617] where:
[0618]
[0619] p is a position belonging to the set of work points.
[0620] md (X), iqr (X) are respectively the median and the interquartile range for the set X.
[0621]
[0622] RFsp is the reference factor for the position p according to the control set s (assigned to sample m).
[0623]
[0624] VRsp is the variability associated with the position p according to the control set s (assigned to sample m).
[0625]
[0626] NCmp is the normalized coverage at the position p for the sample m.
[0627]
[0628] WP is the set of work points for the sample under study and p a position included in said set.
[0629]
[0630] In addition to calculating the previous values, the outlier is associated with (1210) the detected chromosomal positions but also the extreme positions compatible with a structural variant, that is to say, the widest chromosomal range that could reach as much as the minimum interval. of confidence for the ends of the outlier detected (consider for example that there are areas that do not belong to the region of interest or for which there is no coverage). The protocol for determining said extreme positions corresponding to the widest range that could be affected starts from the limits detected for the outlier (O.posini and O posfin) extending them while crossing the adjacent positions not included in O, until finding an aggregation of positions (established by a variable Cfg.BRKnoCNV, by default taken as 20) considered work points and with a variation rate not higher than the characteristic for the outlier region more than three times its characteristic variation, for which the increase / decrease in sample-model coverage ratio is equal to or less than 1/3 characteristic for the initial outlier (the ratio is greater than 1 when the characteristic of the outlier is or lower than this when it is for the outlier). To determine the extreme positions corresponding to the minimum interval that is considered to be affected follows a similar process but crossing the positions adjacent to the ends included in the outlier until finding an aggregation of positions considered Work Points and with a variation rate no higher than the characteristic for the outlier region plus three times its characteristic variation, for which the increase / decrease in sample-model coverage ratio is equal to or greater than 2/3 of the characteristic for the outlier (being the ratio greater than 1 when the characteristic of the outlier is or less than this when it is for the outlier ). After recording the extreme, affected, observed and potential positions, the process ends (1211).
[0631]
[0632] Figure 13 illustrates a possible implementation for the extension of the lower limit of the interval. The process part (1301) of the initialization (1302) to 0 of a counter of non-altered nonCNVs positions . Considering the position p, the beginning of the outlier, as well as r the starting region of the outlier:
[0633]
[0634] 1) If p is the initial position of the first of the regions available in the block of regions analyzed go to step 6 to terminate the protocol. Otherwise set pa to the previous position within the regions of interest, that is: if p is greater than the starting position of the current region r decrease p in one unit, otherwise decrease r in one unit (explore the region previous) establishing p to the value of its final position.
[0635] 2) If for the scanned position, p, it is not satisfied that: p is a work point for the model linked to the sample that is being analyzed and the variation rate for the model (variation between reference value) does not exceed the outlier at its variation rate plus 3 times the variation associated with that rate, then return to step 1. 3) When the proposition w is not satisfied, decrease the counter noCNVs by one unit when it has a value greater than 0 and return to the step 1.
[0636]
[0637]
[0638]
[0639]
[0640] where:
[0641]
[0642] RFsp is the reference factor for the position p according to the control set s (assigned to sample m).
[0643]
[0644] NCmp is the normalized coverage at the position p for the sample m.
[0645]
[0646] O.ratio is the characteristic coverage ratio for the outlier under review.
[0647] 4) If the noCNVs counter is less than Cfg.BRKnoCNV go to step 1.
[0648] 5) Assign (1303) as minimum position initial position, O.imin (Kmite inferior of the interval), the position p minus the number of positions established in the value of configuration Cfg.BRKdecay (by default 25), as minimum region of beginning , O.irmin, assign r and finalize (1304) the protocol.
[0649]
[0650] To calculate the maximum final position, an analogous procedure can be followed, increasing positions from the end assigned to the outlier. For the calculation of the minimum interval, we proceed analogously, but using as condition w the following (and in this case the meaning instead of posting nonCNVs positions are posted positions siCNVs :
[0651]
[0652]
[0653]
[0654]
[0655] where:
[0656]
[0657] RFsp is the reference factor for the position p according to the control set s (assigned to sample m).
[0658]
[0659] NCmp is the normalized coverage at the position p for the sample m.
[0660]
[0661] O.ratio is the characteristic coverage ratio for the outlier under review.
[0662]
[0663] Figure 14 shows the process of determining if an outlier can reflect a structural variant, once its characterization is completed. The process is initiated (1401) by individually selecting (1402) each outlier. It is considered that the selected outlier may reflect a structural variant in the following cases:
[0664]
[0665] 1) If the size in base pairs covered by the region detected as outlier is equal to or greater than the minimum required (Cfg.minTamano), the model coverage assigned to the outlier is equal to or greater than the limit established for the positions Work points (cfg .WPmcb) and the characteristic distance of the outlier is equal to or greater than the minimum required (cfg.dmin) to evaluate the characteristic ratio (point 2) otherwise to dismiss the outlier as a candidate.
[0666] 2) When the characteristic ratio is not included in the interval [0.65,1.35] consider the CNV as a candidate, otherwise only if it exceeds the distance minimum required, exponential with base cfg.dmin and maximum in 1. The minimum distance for anomotor ratio ^ is calculated (1403) according to the function:
[0667]
[0668] ip = cfg.dm.in ((06 ~ ^ Oratlo ~ 1 ^) '15 )
[0669]
[0670] where:
[0671]
[0672] cfg. dmin is the minimum distance required (by configuration) to consider an outlier as a CNV candidate.
[0673]
[0674] 0. ratio is the characteristic coverage ratio for the outlier under review.
[0675]
[0676] The candidate outliers register (1404) in the corresponding register and the process ends. For those outliers who are considered candidates for CNVs, we proceed to the retrieval of contextual information, recording them and evaluating a series of filters that are labeled. The annotation involves retrieving the annotations in the different sources of information supplied to the system for all those positions included within the regions covered by the outlier (even partially) as well as within the previous and following regions (provided they are available). All annotations made on chromosomal ranges that are contained or overlap the range covered by the outlier are also retrieved. Information sources include databases, .bed files, .gff files etc, and .vcf variants and .bam alignments files.
[0677]
[0678] Among the sources of information required and used to solve some filters are the chromosomal mappings of the different isoforms of the different genes that are wanted to be written down. For the regions in the context of the outlier mentioned, they are recovered and associated with the registration of the outlier candidate for CNV: the start and end coordinates of each UTR, intronic and coding region.
[0679]
[0680] Once the information on genetic regions is retrieved, it calculates a compendium isoform where the coding regions of the different isoforms are merged, as are the UTRs for the rest of the unaffected bases, and then the number of coding bases that potentially covers is recorded for the outlier. (ie included between the maximum ends O.mini and Omfin), the number of coding regions, the size in base pairs of said coding regions and the minimum distance of some of the ends of the outlier to a coded base (when the distance is a base-to-base-out-of-one-negative-and-other-positive) base. W hen the value assigned to the transfer is determined by C fg. FILT dcdna (perdefect 10) selecttheoutle withdraftfilter (this allows you to activate the current show theresults in theexploration subsystem), you can alsoimplement this feature to a hardfilter that eliminates the results of the results directly.
[0681]
[0682] C u a n d o s e p r o p o r c i o n e a l s i s t e m a l o s g e n e s / r a n g o s c r o m o s o m i c o s p e r t e n e c i e n t e s a d i s t i n t o s p a n e l e s, s e p r o c e s a d i c h a i n f o r m a t i o n p a r e s t a b l e c e r u n f i l t r o p o r p a n e l e s t a n d o u n o u t l i e r a c t i v o p a r a c a d a p a n e l e n c o n c r e t o s i a l g u n a d e l a s r e g i o n e s d e i n t e r e s p a r a d i c h o p a n e l s o l p e l l a r e g i o n c r o m o s o m i c a p o t e n c i a l m e n t e a b a r c a d a p o r e l o u t l i e r.
[0683]
[0684] E n c u a n t o a l o s r e c u r s o s d e a n o t a t i o n d e v a r i a n t e s e s t r u c t u r a l e s d e d i s t i n t a s f u e n t e s (e j. The b a s e d e d t o s d g v) s e a n o t r e l m a x i m o p o r c e n t a j e d e s o l a p a m i e n t e n t r e a l g u n a d e l a s v a r i a n t e s e s t r u c t u r a l e s d e s c r i t a s d e l m i s m o t i p o (g a n a n c i a s o p e r d i d a s d e c o p i a s) and e l r a n g o a b a r c a d o p o r e l o u t l i e r (p a r a c a d v a r i a n t e e s t r u c t u r a l e l m a x i m o v e n d r a d e t e r m i n e d p o r e l p o r c e n t a j e d e b a s e s d e l v a r i a n t e d e s c r i t e n l a b a s e d e d t o s c u b i e r t a s p o t e n c i a l m e n t e p o r e l o u t l i e r o b i e n p o r e l p o r c e n t a j e d e b a s e s d e l o u t l i e r d e t e r m i n a d a s q u e e s t a n c u b i e r t a s p o r l v a r i a n t e e s t r u c t u r a l d e s c r i t a).
[0685]
[0686] The subsystem of detecting events establishes a communication with a database or a set of records to record each of the CNVs detected in each of these records, keeping information on the proposed and potential data links for each of the characteristics related to the data and the degree of certain reliability. Too muchregistradalsampleforwhich has been detecteddeventless. E sterepositoriodeinfor macionesconsultadodur anteestafasedeanotaci onrecuperandoelnumero devecesqueelrangoafec tadoporeloutlier (I eldetectado, noelpotencial) essolapadoporalgunreg istroendichorepositor ioparaunamuestradisti ntaalaqueestasiendopr ocesadaeigualmenteeln umerodevecesqueestose produceunicamentecons cionesquetenganasigna iderandoaquellasanota doungradodecredibilid adigualosuperioralcar acteristicoparaeloutl ierenestudio. On the basis of these cost values, they are assigned by filtering frequency between the samples. that the chromosomal region affected by the outlier has been studied, the Kmite frequency will be fixed by the parameter Cfg.FILTCNVfreq (default 5%).
[0687]
[0688] In addition to retrieving information from the different sources that provide annotations on chromosomal positions or intervals, it is also retrieved from the variant files of each sample (.vcfs) if those variants that overlap or are contained in the context regions are available. of the outlier that is being annotated, its chromosomal coordinates are recorded as well as its quality and frequency of the alternative allele.
[0689]
[0690] Figure 15 illustrates the process of detection of ruptures (also commonly called by their English name "breakpoints") or their absence from sequenced reads :
[0691]
[0692] 1) The process starts (1501) with the retrieval of the alignment file of the study sample the reads that overlap the interval [0.imin, O.imax] (1502A), that is, they overlap the chromosomal range between the two extreme positions for the lower limit of the outlier, these reads form the READS5 set. For each one of the recovered reads , process its alignment information (1503A) to work on the ones that have the bases in the corresponding field of operations of the alignment file (field known as CIGAR, from the English "Compact Idiosyncratic Gapped Alignment R epo"). Masked laterals -they have applied a logical erasure-, (or as it is commonly known by the term in English, present "softclipping") affecting more than 10 bases (or a chimeric alignment associated with a number greater than 10 segregated bases, without prejudice that instead of 10 bases the system can be configured to demand a different number), recover the sequence that can not be aligned contiguously with the primary one (the one affected by the logical deletion -softclipped-) (1504A) and calculate the position chromosome where the rupture occurs Assign a cluster of breaks (break points) (1505A) for the read from the particular rupture position it presents and registers ar in said cluster (for its cluster position) the sequence affected by the logical deletion.
[0693] To determine the breaking position in softclipping cases, the read CIGAR field is examined . When the first operation (ignoring the operations of hard erasure or by its term in hard clipping English ) of the CIGAR is a softclip of sufficient size, the cut-out position is the starting position of alignment of the read more the number of bases covered by said operation. When the softclipping block is the last operation (ignoring those of hard clipping) it will be the read start alignment position plus the number of cigar match (M) or (D) operations. If p is the break position, the cluster position cp is calculated as: cp = floor ((p + 5) / 10) * 10, that is, the whole part by 10 of the division by 10 of the break position more 5 units.
[0694]
[0695] The softclip sequence, seqS, is taken from the read sequence by extracting as many bases as indicated in the CIGAR linked to the softclipping block from the beginning or the end of the read sequence depending on where the softclipping block is located. the CIGAR.
[0696]
[0697] Retrieve from the alignment file of the study sample the reads that overlap the interval [0.fmin, O.fmax] (1502B), that is to say that they overlap the chromosomal range that delimits the end of the outlier constituted by the two extreme positions as has commented, the minimum coordinate that is affected and the one that could be affected by an underlying structural variant, these reads form the whole READS3. For each of the recovered reads , process the alignment information (1503B) to study the CIGAR and as in the previous step when they meet the requirement in bases, proceed by extracting the softclipped sequences (1504B) (and assigning them to the cluster positions that correspond (1505B).
[0698] Remove from the registry those clusters (cluster positions and their associated data) that have a number of registered sequences lower than the determined limit established for the system Cfg.BRKminsop (default 20) (1506). The motivation of this filtering is to eliminate from the revision the regions where there is not a stack of softclipping or ruptures (that is, they are operations that are suspected to be motivated by sequencing noise and not by biological events).
[0699] For each of the clusters that remain in the registry after the filtering of step 3, retrieve the sequences they have linked (1507) and align those sequences to find the even clusters.
[0700] to. If the cluster being considered has a cluster position equal to or greater than the minimum position calculated for the end of the outlier (O.fmin), its sequences are aligned (the alignment is allowed in forward, reverse, complementary or complementary reverse) against the DNA sequence for the reference genome comprised between the chromosomal positions [O.imin, O.imax] (1508), ie the pair is searched in the range from the initial maximum potential position determined for the outlier and the potential extreme bottom position. When an alignment occurs, with ap1 and ap2 the starting and ending positions of the alignment against the target sequence, the even cluster position is calculated as pcp = floor ((O.imin + ap2 + 5) / 10) * 10, that is, the whole part of the division between 10 of the quotient of the sum of the alignment end position, plus the chromosomal position of the beginning of the sequence against which it is aligned, plus 5 between 10. When there is alignment, the counter or one opens and is set to 1 (in case there is no previous record) for the pcp-cp pair.
[0701] b. When the cluster considered has a cluster position equal to or less than the position calculated as maximum for the beginning of the outlier (O.imin), its sequences are aligned against the DNA sequence for the reference genome comprised between the chromosomal positions [O .fmin, O.fmax] (1509), that is, the pair is searched in the range that goes from the extreme positions to the end of the outlier. When an alignment occurs, with ap1 and ap2 being the starting and ending positions of the alignment against the target sequence, the even cluster position is calculated as pcp = floor ((O.imin + ap1 + 5) / 10) * 10, that is, the whole part of the division between 10 of the quotient of the sum of the initial alignment position, plus the chromosomal position of the beginning of the sequence against which it is aligned, plus 5 between 10. When there is alignment, the counter or one opens and is set to 1 (in case there is no previous record) for the cp-pcp pair.
[0702] Register for the outlier (1510), for the candidate CNV, the cluster position of that individual cluster that has the highest number of linked sequences (O.brk), as well as said number of sequences (O.cbrk) to record the number of reads that support said cluster. Register also the pair of positions of the pair of clusters that has a higher counter (O.pbrk), as well as said counter (O.cpbrk) to have proof of the number of reads that support said pair of clusters.
[0703] Considering the information recorded from the revision of the alignments, a filter is also established. A CNV is marked as filtered by breakpoints when, with the start and end region of the outlier being the same, for some of the two intervals [O.imin, Oimax], [O.fmin, O.fmax] both ends belong to the same region and all its bases have a raw coverage for the study sample equal to or higher than that required for a position belonging to the Work Points (cfg.WPmcb) and the number of reads registered support for the break points cluster (O.cbrk) is less than cfg.FILTmbrk.
[0704]
[0705] 6) Finish (1511) the process.
[0706]
[0707] For each candidate CNV a byte stream is persisted in a storage space, (which may have a different character depending on the deployment of the chosen system, for example it could persist in a temporary file or permanently in a database). The data export stage encodes all the positions of the regions of the CNV context (regions covered by the CNV plus the two previous ones and the following two as long as they are available in the iteration - otherwise within this range those that are possible -). Together with the positions, the value of the signals processed for the different samples and the reference signal is coded and persisted. The way in which this information is encoded, is structured and persisted in the data container is specified later when developing the data container element. This dump of information, either in a temporary storage or with a definitive character allows, the release of resources at the end of an iteration, besides its structuring and codification allow an access and use of the efficient information later. Once an outlier has been identified as a CNV candidate, the data export can be done at any time throughout the corresponding iteration.
[0708]
[0709] Figure 16 presents the calculation and recording of a score that reflects the degree of confidence that the behavior of the signals for the candidate CNV actually reflects an underlying structural variant. This confidence is calculated considering the distance, the sample-reference ratio and its variability as well as the model coverage in the region and the support (in size or by evidence of split-reads).
[0710]
[0711] The first step after the start (1601) is the calculation (1602) of the term of distance assessment associated with the candidate CNV n dist. It is calculated by giving a score according to the application of a function that takes value 0 in 1 and grows in a symmetrical and exponential way by decreasing or increasing the value of the distance until reaching asmtotas in 1, mathematically:
[0712]
[0713]
[0714] The default value for constant a has been determined in 1.7 although it can be varied depending on the particular implementation. .
[0715]
[0716] The term of valuation of the ratio associated with the CNV nrati0 is calculated (1603) giving a score according to the application of a function that is the result of 4 centered Gaussian functions x = O.ratio = [0,0,5, 1,5,2] , mathematically:
[0717]
[0718]
[0719]
[0720]
[0721] It is also important to assess the fluctuation of the ratio along the chromosomal range associated with the candidate CNV, for this purpose, in the calculation (1604) of the term Oratiovar the function is applied:
[0722]
[0723]
[0724]
[0725]
[0726] The factor of punctuation due to the coverage, Qcob, is valued (1605) according to the following function:
[0727]
[0728]
[0729]
[0730]
[0731] The support is valued at maximum in those cases for which a breakpoint with sufficient support has been located, this is O.brk> cfg.BRKmin, in this case the term nsop takes (1606) value 1. When the previous condition the support is valued according to the real and potential size of the CNV, said size being O.ptam = 0.5. [(O.mfin -O.mini 1) + (O.ini-O.fin 1)]. The evaluation of the support is done according to the function:
[0732]
[0733]
[0734]
[0735] where:
[0736]
[0737] 0. ini, O .finimal positions detected for the Outlier.
[0738] 0. mini, 0. m end positions minimum initial and maximum final that could reach the outlier.
[0739]
[0740] Finally, the score is calculated (1608) following the following formula, after which the process ends (1609):
[0741]
[0742]
[0743]
[0744]
[0745] This score takes values between 0 and 10. Note, however, that other formulations, ranges, average weights, etc., can be used to evaluate the degree of confidence in each particular accomplishment.
[0746]
[0747] Figure 17 shows the protocol of fusion of candidate CNVs in new ones that cover them, once the signals of the regions included in a work iteration for a model have been explored, and all the candidate CNVs are characterized, filtered and valued. The fusion protocol is initiated (1701) by ordering (1702) the list of candidate CNVs according to chromosome and initial position of detection. The following steps are followed below:
[0748] 1) Initialize (1703) the merge list by taking the first item from the general ordered list of candidate CNVs that has not been previously explored.
[0749] 2) If there are no CNVs to be analyzed in the general list, or the group and type of the last candidate CNV included in the merger list and the next candidate CNV in the general list do not coincide, carry out the fusion protocol, this is , go to step 4 (the type of a candidate CNV is of profit when the characteristic ratio is greater than one or of loss when it is less than one). When none of the two previous conditions is given:
[0750] to. Calculate (1704) a fusion ratio, rf, as the average of the characteristic ratio of the last item included in the merger list and the next one in the general list to this one. Calculate also the variation of fusion, vf, as the maximum of the characteristic ratio variations of these two same elements.
[0751] b. If the absolute value of the differences of variation of ratio exceeds the variation of fusion, the last element taken can not be included in the current fusion list, it is continued in step four where the fusion set will be resolved and this will be used as the first member of a new list. When the fusion variation is not exceeded, it is continued in step 3 for the search for intermediate blocks not compatible with CNVs.
[0752] Search (1705) blocks of positions, among those affected by the last CNV candidate included in the merger list and the next item in the general list, which would be incompatible with the existence of CNVs.
[0753] to. Initialize a counter (noCNV) to 0, and take as the position and starting region for the exploration the last region and position covered by the last detected CNV of the fusion list (CNVe).
[0754] b. Rate the current position:
[0755] i. If it is not inferior to the first position covered by CNVe:
[0756] When the value of the noCNV counter does not exceed the threshold value specified by the parameter cfg.FUSmnoCNV (default 10), incorporate CNVe and continue in 2 taking as CNVe the candidate CNV detected from the following general list that has just been included in the fusion list. When the noCNV counter is equal to or greater than cfg.FUSnoCNV CNVe can not be added to the merge list, it is continued in step four where the merge set is resolved and this element will be the first member of a new merge list (except that is the last one of the general list).
[0757] ii. When the current position is lower than the first one covered by the candidate CNV detected CNVe:
[0758] 1. if the position belongs to the Work Points and the linked ratio for the study sample in said position is not within the range defined by the fusion ratio plus / minus twice the fusion variability or the following inequality is not met :
[0759]
[0760]
[0761]
[0762] where:
[0763]
[0764] RFs p is the reference factor for the position p according to the control set s (assigned to sample m).
[0765] NCmp is the normalized coverage at the position p for the sample m.
[0766]
[0767] rf is the ratio of fusion.
[0768]
[0769] increase the noCNV counter, if then it is lower than the threshold value cfg.FUSnoCNV evaluate the next position going to step 3.b, otherwise continue in step 4 since CNVe can not be added to the fusion list, but it will be taken as the first member of a new merge list, resolving the merging of the current members of the merge list.
[0770]
[0771] 2. When the three criteria required in the previous point (ii.1) are not met, then the noCNV counter is decremented by one unit (unless it is already 0) and the next position is evaluated by returning to step 3b.
[0772] 4) Solve a fusion set. When the merge list consists of only one member, no merging occurs and the merge list is simply forded, otherwise a new outlier is processed that has extreme initial detection positions: the initial position of the first candidate CNV included in the merge list and the final position detected of the last candidate CNV included in the merger list. When the fusion of several elements occurs, the maximum credibility is recovered (1706). In the event that the credibility score of the new CNV candidate resulting from the merger is lower than that of any of the merged members, it is restored to that higher value. Once the resolution process is over, the records of the candidate CNVs that have been merged are disregarded (1707), leaving a single record for the new candidate CNV resulting from the merger. If there are unexplored items in the general list of candidate CNVs, a new merge list is initialized with the candidate CNV detected next in that list to the last one included in the previous merge list and is continued in step 2, otherwise it is processed the fusion (1708) and ends (1709) the process.
[0773]
[0774] The information generated by the detection subsystem (101) and annotation after all the described processes is finally stored in the data container subsystem (102), of so that it can be accessed from the scanning subsystem (103). The data container subsystem (102) has a byte stream for each candidate cnv. The byte stream encodes, for the entire scan interval associated with the candidate cnv, the included positions according to the resolution of exportation configured in the detection and annotation subsystem, as well as the reference signal and each sample for said positions. For a scan interval, the byte stream is formed by a succession of blocks of equal size that depends on the number of samples. Each block contains information associated with a chromosomal position, preferably succeeded is in increasing order of chromosomal coordinate.
[0775]
[0776] Figure 18A shows a non-limiting example of coding a block, which the detection subsystem (101) is configured to follow when saving the results of its analysis in the data container subsystem (102). In this example, the following parameters appear encoded in binary and according to the machine representation of a 2-byte integer:
[0777]
[0778] - The upper digits (1801) of the chromosome coordinate. That is, the entire part of the result of dividing said coordinate by 10000.
[0779] - The 5 least significant digits (1802) of said coordinate. That is, the rest of the division of the coordinate between 10000.
[0780] - The reference signal (1803) according to the model for said rescaled position to the signal of the study sample for said position and the rescaled signals (1804, 1805, 1806) associated to said position for each of the samples considered in the study.
[0781]
[0782] The rescaled signal (1804, 1805, 1806) for the sample Mi, where MR the sample object of study is calculated according to the formula:
[0783]
[0784]
[0785]
[0786] where the function floor given a number gives the whole number immediately below or equal to it; cobAutosom (M) represents the total coverage in the autosomal chromosomes (1 to 22) in the region of interest for sample M; sexCor (O, M) for a cnv candidate O and a sample M takes value 2 in case the chromosome associated with the candidate cnv is X or Y and the sex linked to sample M is "Woman", in another case is worth 1; Cm, p represents the raw coverage for sample M in the chromosomal position p.
[0787]
[0788] The reference signal (1803) according to the rescaled model is calculated according to:
[0789]
[0790] C ° bREF = RFs p.cobAutosomal (MR). sexCor (0, MR)
[0791]
[0792] where RF represents the reference value for the position p according to the control set assigned to the sample under study.
[0793]
[0794] The blocks included in the data flow are those linked to coordinates that cover the regions affected by the candidate cnv and the two regions of the margins (if available). Depending on the resolution configured for the viewing, data is taken for all the mentioned coordinates, or equispaced coordinates are taken according to the resolution configured for each region covered starting from the initial coordinate of each region.
[0795]
[0796] Each data flow associated with a candidate cnv also has associated metadata with its size in bytes, the maximum value of the signal it contains and the number of coordinates for which the information has been encoded. All the metadata of a data flow associated with a candidate cnv are provided directly to the exploration subsystem, without the need to examine all the data in the container. Once a data flow is located, the scanning system will request blocks of data, associated with specific coordinates claimed by the scanning subsystem (103), which are provided by means of a random access, that is, without the need to access the rest of the blocks of the flow.
[0797]
[0798] When the detection subsystem (101) is configured to generate a result that integrates the data container subsystem (102) and the scanning subsystem (103) into a single file, or when the data container subsystem (102) is not a management system (for example, a database), but a file to be processed by an external scanning subsystem (103), the data flows for each candidate cnv are organized within the file sequentially, as illustrated in the figure 18B. To the succession of flows of data, contents in a body (1813), readings of a metadata or heading (1812), and follow a step of information on localization (1814).
[0799]
[0800] The term (1812) seestructuraendospart is: a part of the unofficial variable (1807) of 22 bytes of size, and a non-renewable step (1808) locatedcontinuation. The variable input (1807) of the master encodes a number of fields in the binary and with representation of the 2- byte signal, shown in detail in Figure 1 8 C. In particular, the variable (1807) section includes the number of data (1815) of the data size among 10000 , the module of the jury (18 16), the version of the data format (1817), the number of samples (1818), the maximum resolution (1819), the scale factor (1820), the delimiter of the list (1821), the delimiter of content. odelista (1822), delimitadordeitem (1823), delimitadodordeato (1824), and delimitor devalor (1825). The adivisionentera (1815) and the module ofdichadivis ion (1816) codify thetatalogotofdata. The aversion to the data format (1817) makes it possible to identify the structure of the data in the event that they are modified or fulfilled later on the method. The maximum resolution (1819) establishes the second step in relation to the collection of data collected in the data frame. The scale factor (1820) indicates the factor for multiplying the values of the data encoded in the data to obtain the actual values (with a possible loss and accuracy greater than 1 ), allowing coding and exceeding the coding of the imonatural data with 2 bytes to reduce the accuracy. The list-delimiter fields (18 21), delimiter of content or list (1822), delimiter-delimiter (1823), delimiter-attribute (1824), and delimiter-value (1825) codify the codes for the characters that are used for the minimum number of information contained in the variable section of the head.
[0801]
[0802] The innovation of the innovation (1808) of the head (1812) is composed by a group of people with a small number of items. C adaitemsoassociated with a variety of attributes and different values. Each character in the list is encoded in an ASCII code and the default representation for a 1 6-bit inset device. As an example, a possible implementation of the head of the patient, a primary nurse with a request to have it examined in relation to the study sample, a second standard with other measures included in the analysis, and a third questionnaire on the sample data.
[0803] L a p r i m e r t i n t i e n t i n e i n t i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i t T h e w a n t i t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t th ... C e d i t e m d e l s e c o n d l i s t a t i e n e a t r i b u t o s c o n s u s c r r e s p o n d i e n t e s v a l o r e s p r r e g i s t r r i d e n t i f i c to r o f a m u e s t r, p l a c a s e q u e n c i to c i o n, l a n e e m d i c d e n t r o r l p l a c a, s e x, and c o n j u n t o r m u e s t r a s d e c o n t r o l p r a d i c h a m u e s t r a. T e r c e r t i t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t C a d a t e m e r t e r t e r t i e t t i t t i e t t i t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t l o c u s and r a n o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o r g e n a n d g e r a n d i n g s; t a m a n o (a t r i b u t o c o m o n t o: t a m a n e p a n d s a n d s a n d a n d a n c e d s) s; d e s v i a n d e a n d e m e n t e n t e d t e d e s t e d e s t h e d e s t e d e d e d e d e d e d e d e d e d e d e d e d e d e e d e d e r e d e d e r e g e r e g e r e g e r e d e r e r e g e r e d e r e r e g e r e r e r e r e r r a t i o e n t a n e l a n d e a t e r t e r t e d e r e n e r e n e r t e r t e r t e r t e r t e r t e r t e r t e r t e r t e r t e r t e r t e r t e r t e r t a r e r e r e r i n f o r m a t i o n s o b r e b e e k o o o n t s (a t r i b u t o c o m m e s t o: l o c a l i z a t i o n d a b a c k e o n t c o n m o n t o n t o n t o f t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t th i n d a r a r d e a n d e r a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a c a l o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o a r e g e a g e a n e a c o r d e a n d a s d e i n i c i o d e l or s r a n g o s c r o m o s o m i c s e n t r e r e g i o n e s d e i n t e r e s a b a r c a d o s p o r e l c n v c a n d i t t o, n or m u e s t r e a d s and o m i t i r e n l a r e p r e s e n t a t i o n p r e l c n v c a n d i t t o p r e l s u b s i s t e e d e e x p l o r a t i o n, v a l o r e s d e f i l t r a d or s or c i e d s to l c n v c a n d i t t o and p t e a t i o n d e c r e d i b i l i t y a s i g n a d a; n u m e r o d e d o p e r i o n o n o n o n o r o r o o n o n o n o n o n o n o n o n o n o n o n o n o n d e d e d e d e d e d e d e d e d e d e d e d e d e d e d i n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d; v a l o r m a x a m e d e a n a l a n d e a n d e a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d e d o f th e d a n d e d e d a n d a n d a n d e d e d a n d a n d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d.
[0804]
[0805] Finally, lacola (1814) is formed by a rcampodecola (1810) and a second field (1811) encoded in 4 bytes. The breeding field (1810) contains the master on the head (1812), while the second field (18 11) repeats the first data items including the variable (1807) of the head. In other words, the second field (18 11) repeats the information of the inital division (1815) and the module of the psychology (1816), codifying them as the total of all data. Thefundanceofthiscammaybecommitthatthedatablockisformatted.
[0806] Figure 19 shows a possible implementation of the procedure for generating the described data flow. After its initialization (1901):
[0807]
[0808] 1. Take (1902) the regions affected by the cnv as well as the region prior to the event and next if it is available. Sort those regions according to the initial position associated with each of them.
[0809]
[0810] 2. Establish (1903) a byte counter, maximum coverage and total points to 0.
[0811]
[0812] 3. As long as there are regions to be processed, take the positions linked to the one in progress ordered from lowest to highest, and begin processing them starting with the lowest (1904). For each position of this list at a distance from the first multiple of the maximum resolution of the established exploration and in any case also considering the final position:
[0813]
[0814] 3.1. Calculate (1905) the integer division of the position by 1000, exporting the result as a whole to a temporary storage using a binary representation of 2 bytes. Export the rest of the position between 10000 to a temporary storage using also a binary representation of 2 bytes.
[0815]
[0816] 3.2. Recover (1906) the set of control samples for the sample object of study and the associated reference value in the model for said set. Calculate the reference coverage C ob REF according to:
[0817]
[0818] C ° b REF = RFs p. c or b A u t o s or m a l (M R). s e x C o r (0, M R)
[0819] where RFsprepresents the reference value for the position p according to the control set assigned to the sample under study.
[0820]
[0821] 3.3. Update (1907) the byte register by adding 6 extra bytes.
[0822]
[0823] 3.4. For each of the study samples:
[0824]
[0825] 3.4.1. Calculate (1908) the coverage for each sample according to:
[0826]
[0827]
[0828]
[0829] where the function floor given a number gives the whole number immediately below or equal to it; cobAutosomal (M) represents the total coverage in the autosomal chromosomes (1 to 22) in the region of interest for sample M; sex C or (O, M) for a candidate Cnv O and a sample M takes value 2 in case the chromosome associated with the candidate cnv is X or the Y and the sex linked to the sample M is "Woman", otherwise it is worth 1; Cm, p represents the raw coverage at position p for the sample m.
[0830] Export this value as the result to a temporary storage using a binary representation of 2 bytes.
[0831]
[0832] 3.4.2. Update (1909) the byte register by increasing it by 2 units and the point register by increasing it by 1 unit.
[0833]
[0834] 3.4.3. Update (1910) the maximum coverage record at the Cobi value when this is greater than the stored value.
[0835]
[0836] 4 Prior to completing the procedure (1911), associate the candidate cnv with the metadata of total points exported, maximum coverage value and number of bytes of the data flow.
[0837]
[0838] Finally, the exploration subsystem (103) is responsible for accessing the data container subsystem (102), recovering the data that must be displayed through the user interface at all times. Said user interface can be for example a web browser to which the results of a detection process are provided in the form of graphics. The user interface preferably has four areas when cvn candidates are reported: header, table of candidate cnvs and controls, cnv map area and detail area. The detail area is preferably composed of three sections: section of response signals, section of annotations and section of ratios.
[0839]
[0840] The header shows information related to the study and shows in particular, such as for example an identifier of the sample and the experiment indicating sequencing plate; lms the index associated with the sample; the sex of the sample assigned by the detector; and data linked to the precision of the result, such as the degree of correlation with the control set, the fragment size, general variation ratios, etc.
[0841]
[0842] The cnvs candidate table shows the list of candidates cnvs associated with the result that is being explored. In each column the values for an attribute or a related set (hereinafter referred to as composite values), for example a measure and its rate of variation) are arranged. The values in a given row correspond to those associated with a candidate cnv. Attributes (or sets of attributes, hereinafter attributes) Compounds), available for their allocation, are determined by the information contained in the data container ( 102 ), advisory of the configuration established for the detection system ( 101 ).
[0843]
[0844] N o t e s t r i b u t s (s i m p l e s) m o u s t m e s t o f t h e s t h e s t h e s t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t In a non-exclusive way, the table can be displayed in terms of "confidence", "locus", "region", "size", "deviation", "ratio", "alignments" and "filters." The "confidence" attribute is a valuable value for the degree of credibility in which the data is a real, real variant, nounartefacto "L c u s" and s u n a t r i b u t o c o m p h o s t o r l p o s i t i o n c r o m o s o m i c a d e t e c t a b l e d e i n i c i o d e l c n v c a n d i t t o and l to p o s i t i o n c r o m o s o m i c a d e t e c t a b l e d e f in d and l c n v c a n d i t t o. "R e g i o n" e s u n a t r i b u t o c o m p h o s t o r l n o m b r e d e g e n f e c t e d and l r e g i o n o r e g i o n s a f e c t e d s (i n d i c a n d l a s s e g u n l a s e t i q u e t a s e s t a b l e c i d s e n l a r e g i o n o f i n t e r e s e n l a c o n f i g u r a t i o n o f s u b s i s t e e d d e t e c t i o n and n o t a t i o n). L a a f e c t o n p a r e a n c e r a n d e a n d e a n t i n g a n t i n g a n d e a t i n g a n t i n g a n d e a t i n g a n d e a t i n g a n d e a t i n g a n t "T a m a n or" e s u n a t r i b u t o c o m p u e s t o d e l t a m a n o r t e c t a b l e q u e a b a r c a l v a r i a n t e e n p a r e s d e b a s e s (d e s d e l p o s i c i o n i n i c i a l i n d i c a d e n l o c u s h a s t a f i n a l) and e l t a m a n o r r e g i o n c o d i f i c a n t e a b a r c a d a (u n i c a m e n t e c o n t a b i l i z a n d e n d i c h o i n t e r v a l o l a s p o s i c i o n e s c o d i f i c a n t e s). "D e s v i a t i o n" i n d i c e l g r a d o d e d i v e r g e n c i a d e l o s v a l o r e s d e l a s e n a l r e s p u e s t a d e l a r e a q u e a b a r c e l c n v c a n d i d a t o p r a l a m u e s t r a d e e s t u d i o f r e n t e a l v a l o r d e r e f e r e n c i a c a l c u l a d or a p a r t i r d e l a s m u e s t r a s c o n t r o l and s e c o m p o n e d e u n v a l o r d e r e f e r e n c i a p a r e l i n t e r v a l o and o t r o v a l o r q u e i n d i c a v a r i a t i o n a l o l a r g or of the m i s m o. "R a t i o" i n d i c a l a p r o p o r t i o n d e p e r d i d a o g a n a n c i e n d e l a s e n a l e n c o n s e c u e n c i a d e l n u m e r o d e c o p i a s d e l a m u e s t r a d e e s t u d i o p a r e l l o c u s a s i g n a d o a l c n v c a n d i d a t o r e s p e c t o a l o s v a l o r e s d e r e f e r e n c i e s t a b l e c i d o s p o r e l c o n j u n t o c o n t r o l and s e c o m p o n e d e u n v a l o r d e r e f e r e n c i a p a r a t o d e l i n t e r v a l o f e c t a d o and o t r o d e v a r i a t i o n a l o l a r g o d e l m i s m o. "A l i n e a m i e n t o" e s u n a t r i b u t o c o m p u e s t o q u e i n f o r m e l m a x i m o n u m e r o d e b r e a k p o i n t s e n c o n t r a d o s l a s r e g i o n e s c r o m o s o m i c a s p r o x i m a s a l o s e x t r e p o s d e l i n t e r v a l o f e c t e d p o r e l c n v c a n d i d a t o and s u l o c a l i z a t i o n c r o m o s o m i c a - p a r e l c r o m o s o m a v i n c a n a l a c a n d a c a n d a n d a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e a n e r "F i l t r o s" i n d i c a l e s t a d e d a n d e a t i n g a n d e a t i n g e s.
[0845]
[0846] Thereafter, for the candidate, thereafter, thereafter (for example, the coverage) for the test sample. The answer to the question is regions affected by the candidate and the two side regions, always provide edited data (the first region of each iteration of the rate of detection in the previous paragraph, nor does it have one post or the last of each association). The sample is displayed in a bi-dimensional graph, the value for the value, and is the value in the list for the chromatic position with the X value. The samples show net values (reference values). E lex X is presented discontinuously, omitting the range of interest regions. As a result, the normalized response to the rest of the samples is shown together with the study, including the associated samples for each control sample. The normalized samples of the different samples of the studies are equal to the scale of our study. The sample of the test case is displayed in detail on the screen. If the model is set at the far end of the regions represented, it is also adjusted to that of the study.
[0847]
[0848] E n e l a n d a m e n t e n t a n d e m e n t a n t a n t e n t a n t a n t a n t a n t a n t a n t a n t a n d a n d e a n d e a n d e a n d a n d e a n d a n d a n d e a n d e a n d a n d e a n d e a n d e a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d e S e s i t u n e n t e n t i n t i n e t i n e a n t i o n t i n e t i n e t i n e t i n e t i n e t i n e t i n e t i n t i n t i n t i n t i n t i n t i n t i n t i n t i n t i n t i n t i n e t A s i m i s, m e s t a l e r a c e r a m e s s s s s s s s s s s s s s t s s t s s t s s t s. D e s t a m e r a n d, e m a n d a n c a n c a n t a n t i n e a n t i n e a t i n e a t i n e t i n e t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t S o b e m a p a p e d e s e c e e t h e n e n t e n t e r t h e m e n t e n t e n t e n t e r t e n t e r t e r t e n t e r t h e t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t
[0849]
[0850] The referenced asections are presented in the same response to the same procedure (including the reference), but by restricting the minimum X-value to the interval selected in the table. In the event that you have not made a selection, you show your totality d. For each of the samples, the displayed genetic variables are displayed (when this information is not available in the databank). P aracedvariantreflexgraphicallysulocalisationandthefrequent frequency. The identified variables are based on the quality and ambience of the film. It is displayed in the coordinate chromosomics near assortment, represented on the X-axis graph (abscissa) and with the Y-interval (ordinate) equal to that corresponding to the sample position for the given sample, whether in the long-range or long-range mark of the sample that the sample is missing from its membership.
[0851] The description of the representation is made in the textual description of the range represented by the detail range. This information may be provided in the form of a data container ( 102 ), or may be charged to an external jurisdiction.
[0852]
[0853] E n l a n c e d o f t e r t e r t e n t e n t o n t o n t e n t o n t e r t e r t e r t e r t e r t e r t e r t e r t e r t e n t e r t e r t e r t e r t e r t e r t e r t e r t e r t E n t s s e c t i n t a m e d s t a c t i n e s t a c t i n e s t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t
[0854]
[0855] T h e r e d e t h e m e r t i n e a n t e r t e r t e r t e r t e r t e r t i n e t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t th
[0856]
[0857] A d e m a n d s s e s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s t s s
[0858]
[0859] - B o t o n t o o g o p e r o o c u l t a l a c e r e a n a l a n t t a t t a t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t
[0860]
[0861] - B or t o n t o o g l e p a r o o c u l t a t o n t o n t h o n s c a n d s c a n d i n g t h e s a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n d a n a l
[0862]
[0863] - B o t o n t o o g l e p a r p e r t e n t a n d e a n d o f th e d a t t a n d s a n d s t u d a r y s t u d s.
[0864]
[0865] - S e c t o r t e p a n e p a n t e r t i n e r t e r t e r g e s t e r r e d t e r t e r t e r t e r t e r t e r t e r t e r t e r t e r t e r t
[0866]
[0867] Theexplorat ion subsystem (103) can be accessed as a result, integrated in a server. In the case of an exemplification (103), it acts as a provider of at least one of the exploration subsystem (103), the description of the descriptions for the subsystem containing the data ( 102 ) supports a random access to the metadata as well as any associated access to any specific acromosomic link. And, it may be possible to go directly to the requisite bankruptcy without the need for a single large number.
[0868]
[0869] When the ex ploration subsystem (103) loads a result and there are candidates, the associated set of elements is consulted. In the first place, go to the header (1812), directly accessing the location. If all of these problems are unresolved in the case of the exploration subsystem (103), the location of the Header (1812) is encoded in the queue (1814). For a Ttotai file size, if the size of the header is Tcabecera and the size of the coded data is Tdatos, the data provider accesses and returns the bytes in the range [Ttotal -Tdatos, Ttotal - Tdates Tcabecera]. Once the metadata is obtained, the information is requested to be plotted in the map and detail areas, values associated with a number of points on the X axis that depends on the graphic resolution that has been configured for the exploration subsystem. The information corresponding to each point is encoded in a block of the data flow corresponding to the candidate cnv selected in the candidate table. As the number of points to request constant for each candidate channel, the size of the information relative to each point is also constant. Therefore, access is random and the time required to graph the information does not depend on the size of the data or the specific points selected.
[0870]
[0871] Initially, the points plotted in the detail area correspond to the graphs in the map area, so it is not necessary to make an additional request to obtain them. Once the information is obtained, the exploration subsystem (103) graphs the corresponding information in the detail area and map (either sequentially in an indistinct order or in a parallel way). For a candidate cnv, the list P of points to request to visualize a graphic interval [ini, pfin] where pini and pfin correspond to the blocks corresponding to two positions in the data flow is calculated according to the procedure:
[0872]
[0873] 1. Include pini in P
[0874]
[0875] 2. Calculate point sampling interval as int = (pfin - pini 1) / Cfg.res where Cfg.res is the configuration value of the number of points to be plotted or resolved.
[0876]
[0877] 3. Include the points pi * n i = 0 .. (Cfg.res-1)
[0878]
[0879] 4. Include pfin if it has not been previously included.
[0880]
[0881] Once requested by the exploration subsystem (103) the points included in P = {p1 .. pn}, the data provider will access the corresponding bytes according to the procedure:
[0882]
[0883] 1. Read or calculate the size of the data block associated with a block of data in bytes. block = 6 (number of samples * 2), where the number of samples is available in the header of the data block. Depending on the deployment architecture, it will have been processed during metadata loading or it will be a known value.
[0884] 2. Add to the requested byte buffer the bytes (or values that they encode) in the byte ranges of the corresponding data stream: [i * block, (i + 1) * block] for i = 0 ... (| P | -1)
[0885]
[0886] 3. Return the bytes or the values that they encode included in the buffer to the exploration subsystem (103) according to the chosen deployment.
[0887]
[0888] For the request of map points, pini and pfin correspond to 1 and the number of points associated with the data flow (available in the metadata already read) respectively, that is, the first and last point or block of the data flow.
[0889]
[0890] When the user selects a new cnv in the table, the scanning subsystem (103) directly requests the map points, since he has already read the metadata previously during the loading of the result. Once the information associated with the points has been obtained, it proceeds to repaint the map and detail areas.
[0891]
[0892] When in the map area the user selects a region for zooming, the scanning subsystem (103) requests the points to be plotted (calculated according to the procedure already described) for the interval [zoomjm, pzoom_fin]. pzoomjm will be the most point (point represented in the map area, X-axis domain) of the map closest to the start of the selected interval and pzoom_fm the closest map point at the end of the selected interval. Once the information for the requested points has been obtained, the exploration subsystem (103) proceeds to repaint the detail area, showing a mark on the map of the region on which the zoom is being made.
[0893]
[0894] In deployments that do not integrate the exploration subsystem (103) and the data themselves in the same file (for example, when the data provider is a database management system), the random access to the metadata and the flows of bytes will be managed by said data provider. Likewise, the structure of the byte stream of the data container subsystem (102) allows a random access to the required bytes according to the flow points demanded by the scanning subsystem (103). In the deployments where the scanning subsystem (103) and the data are integrated in a single file, the data packet is placed as the last element of the file, while the logic of the scanning subsystem (103) is located at the beginning.
[0895] A possible implementation of the deployment in which the exploration subsystem (103) and the data itself are integrated into a single file consists of the detection subsystem (101) generating a file that implements the exploration subsystem in HTML / javascript (103 ). So that the result can be loaded quickly, and the data is not loaded but on demand in the web browser, all the logic is encoded to provide graphics and data as a function associated with a timeout, and a data block is established before the data block. instruction that aborts the load and interpretation of the rest of the file. This solution allows the entire logic of the scanning subsystem (103) and the data storage subsystem (102) to remain active in the browser, randomly accessing the required data of the total of the file, depending on the interaction of the user. This strategy minimizes the consumption of memory and increases the efficiency of the process, being able to work with heavy files with millions of points in an agile manner. For example, random access to a range of bytes is supported in HTML5 by using the slide function .
[0896]
[0897] The representation of discontinuous coordinate ranges in the graphics presented for the areas of detail and zoom is native and transparent to the exploration subsystem (103). Obtained the blocks to be represented, the scanning subsystem (103) assigns in the X axis correlative natural values, starting at 1, for each of the blocks according to their relative order in the sequence of the data flow. The domain of the X axis is always [1..n] where n will be the number of points requested according to the algorithm described for a selection range. The labels on the X axis will not be the values in X but the associated chromosomal coordinates. For an X axis position, the associated chromosomal coordinate is solved by interpreting the value of the first two bytes of the data block associated with the corresponding point, since they contain the chromosome coordinate encoded. The values in the y-axis for each signal are encoded in the corresponding byte of the data block associated with a point.
[0898]
[0899] To divide the graphs into continuous chromosome sections, dividers are drawn on said sections. Among the metadata associated to each candidate cnv are the locations in chromosomal coordinates of different elements included in the chromosomal range of visualization linked to the candidate cnv. As indicated, a mapping between discontinuous ranges of chromosomal coordinates and continuous values on the X axis of the graphic areas is established transparently, based on the order relation of the data blocks linked to the chromosome positions shown. TO Each coordinate of the micaselays a value of X following the following procedure:
[0900]
[0901] 1. T omarlalistablocksofcorrespondingtoquestedpoints (orderedfromthecalculationintheoriginaldataflow) P = {P 1 . .. P n }, waves ready to be corrected and corrected (those that contain a block).
[0902]
[0903] 2. E stablish a dachromosomicade initio de unrango detraduction the encoded in the block of the first point of the list P 1 and as a coordinate system finallied for P 2 .
[0904]
[0905] 3. If the chromatic coordinate is to be translated in the translation range (it is not as good as the beginning of the beginning), return 1 as the X-value corresponding to the chromatic chromatic coordinate.
[0906]
[0907] 4. E n c a s o r q u e l c o r d e n a d a c r o m o s o m i c a n e s t e i n c l u i d e n e l r a n g o d e t r a d u c t i o n e l p u n t o u s e d p r e s t a b l e c e r e l f i n a l d e l r a n g o d e e x p l o r a t i o n n o s e a e l u l t i m o d e l a l i s t a P, s e r e p i t e n l o s p a s or s 2 and 3, t o m a n d o c o m o p u n t o i n i c i a l e l p u n t o f i n a l u s e d e n l a r e p e t i t i o n p r e v i a and c o m o p u n t o f i n a l e l s i g u i e n t e d e l i s t a.
[0908]
[0909] 5. If the coordinate chromosome is correlated with the coordinate value chromosomics encoded for P n the value in X is not defined for the command.
[0910] 6. If the chromosomal coordinate encoded in the data block linked to P n is not coordinated, or if the code is assigned to a chromatic domain for P 1 the X domain will be generated by an exception to be processed. In the event that the code is assigned to the next cluster, it is assigned to the coordinates that are to be tracked, 1 when the coordinate system is not coded for P 1 , and when the coordinate system is overlapped for P n .
[0911]
[0912] E mpleandoprocedimientodescritosemapeanlas coordinatexane Xcorrespondingthecoordinatedchromosomicas ofvariantesvinculadasadadadaflowofdatosdcandidatocacaraas ugraficacion. D elamismamaneraseproce deparasituarydibujarllocalizaciondeexones, intronesyotrasregione senelareademapayde annotations. For the calculation of the values, and the value of the data, the values measured for the reference value, content for each of the data in the 4 and 5 byte.
[0913]
[0914] A l v i s t a d e e s t a d e s c r i p t i o n f i g u r e s, e l e x p e r t e n l a m a t e r i a p o d r e n t e n d e r q u e i n v e n t i o n h a s i d o d e s c r i t a s e g u n a l g u n a s r e a l i z a t i o n e s p r e f e r e n t e s d e l a m i s m p e r o q u e m u l t i p l e s v a r i a t i o n e s p o o d e n s e r i n t r o d u c i d a s e n d i c h a s r e a l i z a t i o n e s p r e f e r e n t e s, s i n s a l i r d e l o b j e t o d e l i n v e n c i o n t a l c o m o h a s i d o r e i v i n d i c a d a.
权利要求:
Claims (17)
[1]
1. Detection method of structural genetic variants from sequencing data of a plurality of samples characterized by comprising, for each sample:
- characterizing (230) each sample by determining a gender as a function of, at least, a chromosomal differential coverage of the X chromosome;
- calculating at least one correlation matrix associated with the experimental covariability of the plurality of samples, from the sequencing data of said samples;
- selecting (406) at least one control structure by iterative clustering (404) of the plurality of samples;
- establishing (213) some work points as a function of variations with respect to a reference value of the control structures; Y
- detect (240) structural genetic variants at the determined work points.
[2]
2. Method according to claim 1 characterized in that the sequencing data comprise readings of new sequencing technologies.
[3]
Method according to any of the preceding claims characterized in that the step of determining the gender of the sample comprises evaluating a relative measure of coverage between the X chromosome and the autosomal chromosomes.
[4]
Method according to any of the preceding claims characterized in that the step of calculating at least one correlation matrix comprises calculating (403) a first correlation matrix of the samples based on at least one variable selected from the profile of coverage, a range of positions, a number of mapped readings and a metric derived from a chromosomal region.
[5]
Method according to any of the preceding claims, characterized in that the step of calculating at least one correlation matrix comprises calculating (404) a second correlation matrix of the samples based on at least one variable selected from the size of the sequenced fragments including adapters, the size of the sequenced fragments without adapters, and a separation of readings in pair-end readings.
[6]
6. Method according to any of the preceding claims characterized in that the clusterization applies a kmeans algorithm.
[7]
Method according to any of the preceding claims, characterized in that it comprises normalizing (703) the data of the plurality of samples according to the determined genre and eliminating enrichment biases.
[8]
Method according to any of claims 1 to 7 characterized in that it comprises normalizing (703) the data of the plurality of samples as a function of a value of aligned readings, said value being assigned to a range of positions.
[9]
Method according to any of the preceding claims, characterized in that the reference value of the step of establishing (213) the work points is measured on a selected variable of coverage, counting of readings and derived metrics.
[10]
Method according to any of the preceding claims characterized in that the reference value of the step of establishing (213) the work points is calculated by a metric selected from the mean, median and other measures of central tendency; and the variation is calculated by a metric selected from the interquartile range, the variance, the typical deviation and other measures of variation.
[11]
11. Method according to any of the preceding claims, characterized in that it comprises applying a screening based on a retention of ratios, information of variants, coding areas and panels.
[12]
12. Method according to any of the preceding claims, characterized in that the step of detecting (240) structural genetic variants comprises assigning (1007) a value indicative of a confidence degree to each detected structural variant, depending on, at least, the deviation with respect to the control structure and the chromosomal region where structural genetic variants are located.
[13]
13. Method according to any of the preceding claims, characterized in that it comprises grouping multiple structural variants detected in the same variant as a function of, at least, the deviation with respect to the control structure and the chromosomal region where structural genetic variants are located.
[14]
Method according to any of the preceding claims characterized in that it further comprises storing the results of the detection of structural genetic variants in a data storage means (102) according to a coding comprising:
- a header (1812) with an invariable size section (1807) comprising data size information and a variable size section (1808) comprising metadata signaling information; - a body (1813) comprising blocks of equal size with chromosomal coordinate information, reference signal (1803) rescaled and rescaled signals (1804, 1805, 1806) associated with each sample; and - a queue (1814) comprising location information.
[15]
15. Method according to claim 14, characterized in that it further comprises accessing from an exploration means (103) the results stored in the data storage means (102) through a random access comprising:
- access the header (1812);
- obtain metadata from the results; Y
- obtain information associated to each point to be represented by accessing the body (1813) and retrieving blocks of information of constant size.
[16]
16. Detection system of structural genetic variants from sequencing data of a plurality of samples comprising, detection means (101), data storage means (102) and scanning means (103) of the stored data, characterized in that the detection means are configured to implement the steps of the method according to any of claims 1 to 15.
[17]
17. Computer program comprising computer program code means adapted to perform the steps of the method of any of claims 1 to 15, when said program is executed in a digital signal processor, a specific integrated circuit of the application, a microprocessor, a microcontroller or any other form of programmable hardware.
类似技术:
公开号 | 公开日 | 专利标题
US10364468B2|2019-07-30|Systems and methods for analyzing circulating tumor DNA
US20210174907A1|2021-06-10|Method of machine learning, employing bayesian latent class inference: combining multiple genomic feature detection algorithms to produce an integrated genomic feature set with specificity, sensitivity and accuracy
Sindi et al.2009|A geometric approach for classification and comparison of structural variants
US20160370961A1|2016-12-22|Organization, visualization and utilization of genomic data on electronic devices
US10600217B2|2020-03-24|Methods for the graphical representation of genomic sequence data
US9165109B2|2015-10-20|Sequence assembly and consensus sequence determination
CN105144179A|2015-12-09|Systems and methods for clinical decision support
CN107103207B|2020-07-03|Accurate medical knowledge search system based on case multigroup variation characteristics and implementation method
US20060281097A1|2006-12-14|Method of processing and/or genome mapping of ditag sequences
US11183269B2|2021-11-23|Systems and methods for tumor clonality analysis
Gao et al.2019|TideHunter: efficient and sensitive tandem repeat detection from noisy long-reads using seed-and-chain
CN106795568A|2017-05-31|Method, system and the process of the DE NOVO assemblings of read is sequenced
Chen et al.2017|Recent advances in sequence assembly: principles and applications
Galano-Frutos et al.2021|Molecular dynamics simulations for genetic interpretation in protein coding regions: where we are, where to go and when
CN110570905A|2019-12-13|method and device for constructing omics data analysis platform and computer equipment
ES2711163A1|2019-04-30|System and method of detection of structural genetic variants. |
Starlinger et al.2018|Variant information systems for precision oncology
WO2019094636A1|2019-05-16|Structural variant analysis
CN106778072B|2019-05-21|For the process bearing calibration of second generation Oncogenome high-flux sequence data
Giese et al.2014|Specificity control for read alignments using an artificial reference genome-guided false discovery rate
ES2456240T3|2014-04-21|Method and computer system to evaluate classification annotations assigned to DNA sequences
WO2014145503A2|2014-09-18|Sequence alignment using divide and conquer maximum oligonucleotide mapping |, apparatus, system and method related thereto
CN111913999A|2020-11-10|Statistical analysis method, system and storage medium based on multiomic and clinical data
Chen et al.2017|Using DIVAN to assess disease/trait-associated single nucleotide variants in genome-wide scale
CN110544506A|2019-12-06|Protein interaction network-based target point PPIs | drug property prediction method and device
同族专利:
公开号 | 公开日
ES2711163B2|2021-04-14|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题
US20090098547A1|2002-11-11|2009-04-16|Affymetrix, Inc.|Methods for Identifying DNA Copy Number Changes Using Hidden Markov Model Based Estimations|
US8600718B1|2006-11-17|2013-12-03|Microsoft Corporation|Computer systems and methods for identifying conserved cellular constituent clusters across datasets|
CA2739457A1|2008-10-31|2010-05-06|Abbott Laboratories|Genomic classification of non-small cell lung carcinoma based on patterns of gene copy number alterations|
WO2016187051A1|2015-05-18|2016-11-24|Regeneron Pharmaceuticals, Inc.|Methods and systems for copy number variant detection|
法律状态:
2019-04-30| BA2A| Patent application published|Ref document number: 2711163 Country of ref document: ES Kind code of ref document: A1 Effective date: 20190430 |
2019-12-16| FC2A| Grant refused|Effective date: 20191210 |
优先权:
申请号 | 申请日 | 专利标题
ES201731242A|ES2711163B2|2017-10-23|2017-10-23|System and method for detecting structural genetic variants.|ES201731242A| ES2711163B2|2017-10-23|2017-10-23|System and method for detecting structural genetic variants.|
[返回顶部]