专利摘要:
Procedure and system for the classification and detection of the most influential domains within the dark Tor network (1). The method includes tracking and downloading raw text (3) of a domain (7), by means of a computer (2) connected to the internet; obtain an HTML file (4) from the raw text (3); preprocess the raw text (5) to obtain a preprocessed text (6) with inbound and outbound hyperlinks of the domain (7); perform an automatic classification (6a) of the preprocessed text (6) by means of feature vectors and an automatic learning algorithm based on regression, and obtain a series of categories of domains (6b); construct a graph of activities of interest (8) with the incoming and outgoing hyperlinks; determine the rank value (9) for each domain, obtained by weighting the sum of links of each node; perform an ordering of the nodes; label the most influential domains. (Machine-translation by Google Translate, not legally binding)
公开号:ES2719123A1
申请号:ES201831145
申请日:2018-11-26
公开日:2019-07-08
发明作者:Wesam Al-Nabki Mhd;Fernández Eduardo Fidalgo;Gutiérrez Enrique Alegre;Robles Laura Fernández
申请人:Universidad de Leon;
IPC主号:
专利说明:

[0001]
[0002]
[0003] MOST INFLUENTIAL DOMAINS INSIDE THE DARK NETWORK TOR
[0004]
[0005] OBJECT OF THE INVENTION
[0006] The object of the present invention is an automated method and system for classifying and detecting the most influential domains within the dark network Tor (The Onion Router) based on the hyperlinks they contain. The invention allows to identify which domain is the most influential within the network, that is, the domain whose elimination would cause the greatest destabilization of the network. Such destabilization would affect a very high reduction or obstruction in the transmission of information across the different domains. This invention also allows identifying the categories of the different domains analyzed, and identifying the most influential domain by category.
[0007]
[0008] BACKGROUND OF THE INVENTION
[0009] Currently, the way to access information on the Internet is through the use of search engines, such as Google or Bing. Although they are efficient and very powerful, they cannot index all the content on the Web. The content that is indexed belongs to the Superficial Web, and the rest of the content, which is not indexed, belongs to the Deep Web. Within the Deep Web, there is a part formed by several networks and is called Dark Network, being Tor (The Onion Router) the most famous dark network, due to the level of anonymity it provides to its users.
[0010]
[0011] Due to Tor's topology and the fact that it is not possible to record domain traffic, it is not possible to establish a measure of the popularity or influence of such domains within the network. In addition, given the anonymity it provides, the dark network Tor hosts domains with different contents, both legal and illegal. Following the illegal content of Tor, there is a need to classify such content into different types of illegal activities and identify which domains are the most influential within the network, both globally and for each category. The elimination of the most influential domains would destabilize the network and make it difficult to transmit information across the different domains.
[0012]
[0013] Manual classification of domains has a number of drawbacks. In the first place, due to the high number of possible domains, of the order of tens of thousands, It requires a high investment of time and personnel to make the classification. In addition, the classification depends largely on the person who performs it, providing subjectivity that generates a discrepancy between experts. Also, the classification of domains is not always reliable, since errors frequently result from fatigue and lack of expert attention. Finally, it is an expensive process due to the high costs associated with the time of the person making the classification.
[0014]
[0015] Due to the previous characteristics of the Tor dark network, it would only be possible to use the textual content of their domains to measure their influence, given that certain visual contents would be illegal according to Spanish regulations, such as child pornography. Taking into account this limitation, the analysis of the relevance of the domains based on graph theory is used, where the domains will be the nodes and the hyperlinks between them the links. According to several studies (K. Taha and P. Yoo, “Using the Spanning Tree of a Criminal Network for Identifying Its Leaders”, IEEE Transactions on Information Forensics and Security, vol. 12, no. 2, pp. 445-453, 2017 ), in order to destabilize a network it is necessary to eliminate the most influential nodes of the graph, in order to achieve a significant reduction in the flow of information through the graph.
[0016]
[0017] The measurement of the influence or relevance of the nodes within a graph began to be used more than 60 years ago. Initially we worked with measures of centrality (Bavelas, A. (1948). "A mathematical model for group structures." Human Organization 7: 16-30.), Which over time evolved to more complex algorithms, such as Katz (Katz , L. (1953) .A New Status Index Derived from Sociometric Analysis. Psychometrika, 39-43.), HIST (Jon M. Kleinberg. 1999. Hubs, authorities, and communities. ACM Comput. Surv. 31, 4es, Article 5, December 1999) or PageRank (US6285999B1, 1997).
[0018]
[0019] Although the previous influence measurement algorithms achieve acceptable results when destabilizing a network, a method and system for the classification and subsequent detection of the most influential domains in the dark Tor network has not been specifically described.
[0020]
[0021] DESCRIPTION OF THE INVENTION
[0022]
[0023] The object of the present invention is an automated method and system for classifying and detecting the most influential domains within the dark network Tor (The Onion Router) in function of the hyperlinks that exist between them. Hyperlinks are, therefore, typical of domains.
[0024]
[0025] The method and system for the classification and detection of the most influential domains within the dark Tor network (The Onion Router) of the present invention allows to automatically classify and sort large repositories of domains obtained by digital technology (computer) with internet access, and connected to the Tor network to collect textual information.
[0026]
[0027] The automatic classification, as an intermediate step to the detection of influential domains, compared to the manual by an expert cancels the subjectivity, the errors due to fatigue and lack of attention, the disparity of criteria among experts, the costs associated with the time of the expert, decreases the time required for classification and increases the reliability of labeling. For this reason, this procedure can be implemented in tools used by companies and FFCCSSEE (State Security Forces and Bodies) to classify the domains of the dark Tor network in different categories, and their subsequent detection of the most influential domains , both within the Tor network and in each illegal category selected.
[0028]
[0029] The present invention can also be applied in the training or distance learning of specialized personnel in the different categories considered illegal. The provision of large sets of already classified domains and the current possibilities to collect new domains and send them to a system remotely, would allow FFCCSSEE or company personnel to improve their knowledge about the different illegal categories present in the dark Tor network, that would increase their knowledge when differentiating them from other legal categories present in said network.
[0030]
[0031] Within the process prior to the process of the invention, the system would allow the automatic classification of domains by means of text coding through the Frequency of Terms - Reverse Document Frequency (TF-IDF, “Term Frequency -Inverse Document Frequency), metric that indicates the relevance of a word in a document, and its subsequent classification with Logistic Regression (LR, “Logistic Regression”). Classified domains belong to the following categories: (i) pornography, (ii) cryptocurrency, (iii) credit card smuggling, (iv) sale of illegal drugs, (v) violent activities, (vi) cyber attacks (hacking) , (vii) counterfeit currency, (viii) contraband of personal identification and (ix) others.
[0032]
[0033] Next, the process of the invention allows the measurement of the influence of illegal domains to be carried out both (i) globally and (ii) at the level of each of the above categories. For this, a Graph of Interesting Activities is constructed, where each domain is represented by a node, and the links of the graph come from the different incoming and outgoing hyperlinks contained in the previous domains. Therefore, the links are typical of the nodes and there are as many incoming links to the node as hyperlinks point to it and as many outgoing links as the node points to other hyperlinks. In this process duplicate links are eliminated, that is, those that have the same origin and destination, to avoid the creation of multigraphs, and also those links to the Superficial Web. Next, the algorithm is applied that allows to identify which are the most influential domains within the entire network, and also for each category, whose elimination would affect the flow of information within the network.
[0034]
[0035] In a preferred embodiment of the invention, this procedure applies to illegal domains of the Tor network, both globally and by the different categories, although it can be applied without categorizing illegal activities. It can also be extended to legal domains of the same network, to its application on the Surface Web, to contact networks and, in general, to any type of network where there are incoming and outgoing links between its different elements.
[0036]
[0037] In the present description, before coding the text for later classification, it is preprocessed. After a scan of all Tor domains, the resources of those that are active are downloaded and their HTML file is extracted. Next, those domains that are in English are selected and HTML language tags, special characters and empty words are removed. In the present invention, the term "raw text" refers to what is contained in the HTML of the domain and where no preprocessing has been applied. On the other hand, the term "text" is generally used to refer to the resulting text after the "raw text" preprocessing according to the previous description.
[0038]
[0039] The preferable procedure for the classification and subsequent detection of the most influential illegal domains within the dark Tor network of the present invention comprises the following steps:
[0040] 1. Domain tracking and raw text download. From a public list of domains of the Tor network, these domains are tracked and downloaded, for each domain that is active at the time of the scan, its HTML file, which contains the raw text. This domain tracking and downloading is done through a computer with an internet connection and the dark Tor network.
[0041]
[0042] 2. Preprocessing of raw text: within the same computer, for each HTML file obtained the preprocessing of the raw text contained to obtain the text is performed. Next, and according to a preferred embodiment of the invention, those domains that are in English are selected through a language detection library, in order to improve the subsequent training of the classification system, since English is the language mostly used in the Tor network. In a preferred embodiment, the HTML language tags, special characters and empty words are removed, resulting in the final text.
[0043] 3. Classification of the text: in accordance with a preferred embodiment of the invention, an automatic text classification process is carried out, in order to be able to identify which are the most influential domains of the dark Tor network within each of the categories of illegal activities it contains. In a preferred embodiment, the text is coded with the Frequency of Terms - Reverse Document Frequency (TF-IDF, "Term Frequency - Inverse Document Frequency") and the domains with Logistic Regression (LR, "Logistic Regression") are classified into the following categories: (i) pornography, (ii) cryptocurrency, (iii) credit card smuggling, (iv) sale of illegal drugs, (v) violent activities, (vi) cyber attacks (hacking), (vii) counterfeiting currency, (viii) personal identification contraband and (ix) others. According to a preferred embodiment of the invention, for the Frequency of Terms - Reverse Frequency of the Document a minimum vector length of three and maximum of 10,000 elements is used, and for the classification with Logistic Regression the balance of weights between classes was activated .
[0044]
[0045] 4. Construction of the Interesting Activities Graph: once we have the text ready, the Interesting Activities Graph is built for all the domains of the Tor network. In a preferred embodiment of the invention, the Graphs of Interesting Activities corresponding to the domains classified in the nine categories indicated above are also constructed. In said preferred embodiment, each domain is associated with the nodes of the graph and the links between the different nodes are established based on the hyperlinks incoming and outgoing of each domain. In the preferred embodiment, duplicate hyperlinks are eliminated, that is, those having the same origin and destination.
[0046]
[0047] 5. Calculation of most influential domains with influence algorithm. Finally, the most influential domains of the dark Tor network are calculated. According to a preferred embodiment of the invention, the most influential domains of the different illegal categories of the dark Tor network are also calculated for which the previous Interest Activity Graphs were generated. Said calculation is carried out in two phases. In the first phase, the influence algorithm, the domain influence measurement algorithm is applied to calculate the ranking of the different domains extracted from the Tor network. The range value of a domain is obtained as the weighted combination of the sum of the number of hyperlinks of the follower domains and followed by the analyzed domain. According to several studies, the destabilization of a network is achieved by eliminating nodes with a higher ranking, which results in an obstruction in the flow of information through the graph. According to a preferred embodiment of the invention, the influence within the dark Tor network is interpreted as the amount of obstruction that a node can cause to the Interesting Activities Graph when it is removed. In a second phase, a descending ordering of said domains is carried out according to the rank value obtained, the first domains being the ones with the highest value and therefore considered the most influential.
[0048]
[0049] A second aspect of the present invention relates to a system for the classification and subsequent detection of the most influential illegal domains within the dark Tor network from domains recovered from the Dark Tor network. The system comprises data processing means, such as a computer with an internet connection, configured to track and download raw text or HTML from domains of the Tor network; preprocess the raw text to obtain text prepared for analysis; perform an automatic classification (optional) of these domains by Frequency of Terms - Reverse Document Frequency and Logistic Regression; generate a Graph of Activities of Interest, the nodes being the domains and the links the hyperlinks between the different domains; apply the influence algorithm to obtain the most influential domains within the Tor network, being those that obtain the highest value of the algorithm.
[0050]
[0051] In a preferred embodiment of the invention, the system comprises a computer connected to the internet and with configured access to the dark Tor network. The system can also comprise data storage media where HTML or text files are stored in gross, files containing the preprocessed text, the categories of the domains, the Graphs of Activities of Interest and the ordering of the domains of the Tor network sorted according to the rank value.
[0052]
[0053] Finally, the present invention also relates to a program product comprising means of program instructions for carrying out the procedure described above when the program is executed in a processor. The program product is preferably stored in a program support medium. The program instruction means may have the form of source code, object code, an intermediate source of code and object code, for example, as in partially compiled form, or in any other form suitable for use in the implementation of the processes according to the invention.
[0054]
[0055] The program support medium can be any entity or device capable of supporting the program. For example, the medium could include a storage medium, such as a ROM, a CD ROM or a semiconductor ROM, a flash memory, a magnetic recording medium, for example, a hard disk or a solid state memory (SSD, from English solid-state drive). In addition, the program instruction means stored in the program holder can be, for example, by an electrical or optical signal that could be transported via electrical or optical cable, by radio or by any other means.
[0056]
[0057] When the program product is incorporated into a signal that can be directly transported by a cable or other device or medium, the program support may be constituted by said cable or other device or means.
[0058]
[0059] As a variant, the program support can be an integrated circuit in which the program product is included, the integrated circuit being adapted to execute, or to be used in the execution of the corresponding processes.
[0060]
[0061] BRIEF DESCRIPTION OF THE DRAWINGS
[0062]
[0063] Next, a series of figures that help to better understand the invention and that expressly relate to an embodiment of said invention that is presented as a non-limiting example thereof are described very briefly.
[0064] Fig. 1 shows a simplified scheme of a system capable of carrying out the process of the invention.
[0065]
[0066] Fig. 2 shows an example of the HTML content or raw text of a domain in the dark Tor network.
[0067]
[0068] Fig. 3 shows an example of the resulting text after preprocessing the domain of the dark Tor network presented in Fig. 2.
[0069]
[0070] Fig. 4 shows the Interest Activity Graph for all illegal domains of the Tor network.
[0071]
[0072] Fig. 5 shows the Graph of Activities of Interest for the domains belonging to one of the previously mentioned categories, including “smuggling of credit cards” and “drug sales” of the Tor network.
[0073]
[0074] Fig. 6 shows the output of the data file that would contain the domains of the dark Tor network arranged according to the range value, indicating the influence within said dark network.
[0075]
[0076] PREFERRED EMBODIMENT OF THE INVENTION
[0077]
[0078] An example of a method according to the invention is described below, with reference to the attached figures. Figure 1 shows a simplified scheme of the most influential domain tracking, classification and detection system. All this could be implemented in a computer 2 (which could be, any desktop or portable computer with a core, 512MB of RAM and 8GB of hard disk). Computer 2 connects to the internet and is configured to access the dark network Tor 1. Next, domain tracking 3 is performed and those that are active in its raw text 3 are downloaded, obtaining an HTML text file 4 On this file a preprocessing of the raw text 5 is carried out to obtain the final text 6 on which it will work. In the example of the method according to the invention, an automatic classification of the text 6a is also carried out, resulting in a series of labels that correspond to the different categories of the domains analyzed 6b. From the preprocessed text 6, hyperlinks are extracted incoming and outgoing of each domain and an Interest Activity Graph 8 is built for the entire Tor dark network. Additionally, in this example of the method according to the invention, a Graph of Interest Activities is constructed for each of the resulting categories after the automatic classification process. Finally, the influence algorithm 9 is applied, resulting in a data file 10 where the Tor domains appear in order of their range value. Domains located in the first position are considered the most influential in the network. Additionally, in this example of the method according to the invention, a data file is generated for each of the resulting categories after the automatic classification process, where the domains that belong to the same category appear sorted by their range value. Next, each step of the process of the invention is described.
[0079]
[0080] The connection of the computer 2 to the internet can be made through a wireless connection or through an Ethernet network cable. The connection of the computer 2 to the dark network Tor 1 comprises a process of installing special software that allows connecting to said dark network, such as the installation of the Tor browser, "Tor Browser". The purpose of this connection and configuration is to obtain the raw text necessary to perform the automatic classification and subsequent calculation of the most influential domains.
[0081]
[0082] Next, we proceed to track domains and download raw text. First, you get a public list of domains from the dark Tor network, which could be obtained from the Surface Web. Given the life cycle of the domains of the Tor network, no link is provided in this document. This list of domains is read by the tracking program and the raw text 3 of those domains that are active is downloaded, thus obtaining an HTML file per analyzed active domain.
[0083]
[0084] Figure 2 shows an example of the HTML content or raw text of a domain in the dark Tor network. As you can see, there are many textual resources not belonging to natural language, such as the tags of the HTML programming language, which must be removed before continuing with the procedure and thus achieve greater classification accuracy.
[0085]
[0086] In the next stage we proceed to the preprocessing of the raw text contained in the HTML files recovered from the dark Tor network. First, the HTML language tags are removed and, in the case of tags that reference images, the HTML
[0087]
[0088]
[0089] extension and the name of the image is left. Next, those domains whose language is English are selected, since it is the dominant language of the Tor network, although it could be done with other languages. In this preferred embodiment of the invention, said selection is made with the Langdetect library (https://pypi.pvthon.org/pypi/langdetect). Finally, special characters and empty words are removed through the SMART empty word list (http://www.ai.mit.edu/proiects/imlr/papers/volume5/lewis04a/a11-smart-stop-list/) . Due to the scope of work, that is, the dark Tor network, this list is modified and 100 new words are added to improve compatibility. Finally, all emails, web addresses and currencies are unified into a single textual resource. Figure 3 shows an example of the resulting text after preprocessing the domain of the dark Tor network presented in Figure 2 .
[0090]
[0091] After the preprocessing of the raw text we proceed to the automatic classification of the domains, in order to be able to calculate which are the most relevant domains within each category, and not only identify them at the level of the entire dark Tor network. Text already processed is preferably encoded by TF-IDF (Akiko Aizawa. 2003. An information-theoretic perspective of tf-idf measures. Information Processing & Management, 39 (1): 45-65.), Using a minimum vector length of three and maximum of 10,000 elements. The system is then trained with LR (David W. Hosmer Jr. and Stanley Lemeshow. 2004. Applied logistic regression. John Wiley & Sons), activating the balance of weights between classes. The resulting categories are: (i) pornography, (ii) cryptocurrency, (iii) credit card smuggling, (iv) sale of illegal drugs, (v) violent activities, (vi) cyber attacks (hacking), (vii) counterfeit currency, (viii) personal identification smuggling and (ix) others.
[0092]
[0093] Once the preprocessing of the raw text and the classification of domains have been completed, the Interest Activity Graphs are constructed. Initially, the incoming and outgoing hyperlinks belonging to the HTTP and HTTPS protocols are extracted for each domain. During this process, hyperlinks are eliminated, and therefore graph links, duplicates, that is, those that have the same origin and destination, to avoid the creation of multigraphs, and also those hyperlinks that point to the Superficial Web. Next, the Interest Activities Graph is constructed, where the nodes correspond to domains, and the links with the different incoming and outgoing hyperlinks contained in the previous domains. A link between two nodes A and B is generated as long as domain A refers to domain B, or vice versa, at least once.
[0094] Figure 4 shows an overview of the Interest Activity Graph for all domains considered in the Tor network.
[0095]
[0096] Figure 5 shows a more detailed view of the Interest Activities Graph, where you can see how the nodes that represent each domain are categorized and the multiple links between them.
[0097]
[0098] Finally, the calculation of the influence of the list of domains is performed for all the domains analyzed of the dark Tor network as for the domains within the following categories (i) pornography 11, (ii) cryptocurrency 12, (iii) smuggling of credit cards 13, (iv) sale of illegal drugs 14, (v) violent activities 15, (vi) cyber attacks 16 (hacking), (vii) counterfeit currency 17, (viii) personal identification contraband 18. I don't know It includes the category others as it encompasses activities of multiple types and a list of domains is already being calculated for the entire network, which is considered not to contribute to providing a relevant list of domains.
[0099]
[0100] This measure of influence is based on the calculation of the range value associated with each domain and the subsequent descending ordering of said domains according to the range value obtained, the first domains being the ones with the highest value and, therefore, those considered as more influential The influence algorithm identifies the most central node of a graph by measuring the number of nodes to which traffic can propagate and the number of nodes from which it receives traffic. The calculation of the range value consists of two phases, one phase of initialization and one of updating of weights. Given an Interest Activity Graph that contains N nodes and E links, the algorithm is initialized by assigning an initial weight In to each node n, using the following formula:
[0101] Wn = Di D 0 (1)
[0102]
[0103] where Di is the value of the input grade and D0 is the value of the output grade respectively. The value Di is related to the number of incoming links of a node and the value D0 represents the number of outgoing links of that node.
[0104]
[0105] Next, each node n is assigned the cumulative weight of its followers, which are the nodes that are pointing to node n , and the weight of the nodes that node n follows. The range value (R) is assigned to node n taking into account the initial weight Wn assigned (1) according to the following formula:
[0106] Rn = Wn log (aW F pwf) (2)
[0107]
[0108] where WF is the accumulated weight of the followers and Wf is the accumulated weight of the nodes to which n follows. Parameters a , count the contribution of the weights of the followers and of the nodes to which the node follows.
[0109]
[0110] The influence algorithm allows to identify which domain is the most influential within the network, that is, the domain whose elimination would cause the greatest destabilization of the network. Such destabilization would affect a very high reduction or obstruction in the transmission of information across the different domains. To measure how the transmission of information within a graph is affected after removing a node, a density measurement is used according to the following formula
[0111]
[0112] Dg N (N -1) (2)
[0113]
[0114] Where E represents the number of links and N the number of nodes in the graph. Therefore, an order of domains according to their influence would be one that had the lowest possible density after eliminating the lowest possible number of nodes within which they would have obtained a high rank or score.
[0115]
[0116] Once the range value is obtained for all nodes, an evaluation procedure is carried out where the nodes that have obtained the highest range value are eliminated from the graph one by one and the density is recalculated after each elimination. This process continues until the graph is completely disconnected, that is, with a density of 0. Through experimentation with different values, the values of 1.0 and 0.2 are assigned to the parameters respectively as the final values for the influence algorithm, given that they are the ones that allow obtaining the lowest value of the area under the curve (AUC, “Area Under the Curve”), which is associated with the lowest possible density when eliminating the lowest number of nodes with the highest value of ranking.
[0117]
[0118] Figure 6 shows an example of the output of the data file that would contain the domains of the dark Tor network arranged according to the influence algorithm, indicating the influence within said dark network.
[0119]
[0120]
one
权利要求:
Claims (12)
[1]
1. Procedure for the classification and detection of the most influential domains within the dark Tor network (1), characterized in that it comprises the following steps:
- track a plurality of domains within the dark Tor network (1) and download raw text (3) of at least one domain (7), through a computer (2) connected to the internet and configured to access the network dark Tor (1);
- obtain an HTML file (4) from the raw text (3);
- preprocess the raw text (5) to obtain a preprocessed text (6) and a plurality of incoming and outgoing hyperlinks extracted from the at least one domain (7);
- perform an automatic classification (6a) of the preprocessed text (6) through a coding of the text using feature vectors and a machine learning algorithm based on regression, and obtain a series of categories of domains (6b);
- construct a graph of activities of interest (8) from the incoming and outgoing hyperlinks extracted from the at least one domain (7);
- determine the range value (9) for each domain, which is obtained as the weighted combination of the sum of the number of links of each node, obtained the links from hyperlinks to follower domains and followed by each domain (7) ;
- perform an ordering of the nodes, and therefore of the corresponding domains, according to their range value;
- label the domains with the highest rank value as the most influential within the dark Tor network.
[2]
2. The method according to claim 1, wherein the preprocessing of the raw text (5) of the HTML files (4) comprises:
- remove tags from the HTML language;
- remove extension and leave the name of the image in the case of labels that reference images;
- select domains in English with a language detection library;
- remove special characters and empty words through a list of empty words;
- edit and add a plurality of new empty words to improve compatibility with the dark network domain Tor;
- Unify all emails, web addresses and currencies in a single textual resource.
[3]
3. The method according to claim 1, wherein the step of performing an automatic classification of the domains (6a) comprises:
- encode the text already processed by the Frequency of Terms technique - Reverse Frequency of the Document using a feature vector length of between three and 10,000 elements;
- train a machine learning algorithm with a training set labeled using Logistic Regression, and activating the balance of weights between classes;
- classify, using the trained machine learning algorithm, the domains of the dark network Tor (1) into at least one of the nine defined classes:
• pornography,
• cryptocurrency,
• credit card smuggling,
• sale of illegal drugs,
• violent activities,
• cyber attacks,
• counterfeit currency,
• personal identification smuggling, and
• others.
[4]
4. Method according to claim 1, comprising displaying a label of each block of text (6) once it has been classified.
[5]
5. A method according to claim 1, comprising the construction of a graph of activities of interest (8) for the entire dark Tor network (1) and a graph of activities of interest (8) for each category resulting from the classification (6b ) through the incoming and outgoing hyperlinks (7) of each domain, where the construction of the graph of activities of interest includes:
- extract for each domain the incoming and outgoing hyperlinks belonging to the HTTP and HTTPS protocols;
- remove hyperlinks that have the same origin and destination, and hyperlinks that point to the Surface Web;
- build a Graph of Activities of Interest, where the nodes correspond to domains, and the links with the different incoming and outgoing hyperlinks contained in the previous domains so that a link is generated between two
one
nodes A and B provided that domain A refers to domain B, or vice versa, at least once.
[6]
6. The method according to claim 1, which comprises the calculation of the influence algorithm for all the domains of the Global Interest Activity Graph (8), identifying the most relevant domains within the dark Tor network those that obtain the highest range values , where the calculation of the range value for any domain comprises:
- initialize the range value with an initial weight based on the number of incoming and outgoing links of said node within the Interest Activities Graph;
- update the range value taking into account the initial weight and the weighted sum of the accumulated weight of its followers and followed domains, where the influence of the weights of the followed and followed domains is calculated through two parameters a, p.
[7]
7. The method according to claim 1, comprising the calculation of the influence algorithm for all domains of the Global Interest Activity Graph (8) calculated for each category, identifying as the most relevant domains within each category of the dark Tor network. that obtain the highest range values within the selected category, where the calculation of the range value for any given domain comprises:
- initialize the range value with an initial weight based on the degree of entry and exit of said node, that is, the number of incoming and outgoing links of said node within the Interest Activities Graph;
- update the range value taking into account the initial weight and the weighted sum of the accumulated weight of its followers and followed domains. The influence of the weights of the following and followed domains is calculated through two parameters a, fí.
[8]
8. System for the classification and detection of the most influential domains within the dark Tor network characterized by comprising data processing means (2) configured to:
- track a plurality of domains within the dark Tor network (1) and download raw text (3) of at least one domain (7), through a computer (2) connected to the internet and configured to access the network dark Tor (1);
- obtain an HTML file (4) from the raw text (3);
one
- preprocess the raw text (5) to obtain a preprocessed text (6) and a plurality of incoming and outgoing hyperlinks extracted from the at least one domain (7);
- perform an automatic classification (6a) of the preprocessed text (6) through a coding of the text using feature vectors and a machine learning algorithm based on regression, and obtain a series of categories of domains (6b);
- construct a graph of activities of interest (8) from the incoming and outgoing hyperlinks extracted from the at least one domain (7);
- determine the range value (9) for each domain, which is obtained as the weighted combination of the sum of the number of links of each node, obtained the links from hyperlinks to follower domains and followed by each domain (7) ;
- perform an ordering of the nodes, and therefore of the corresponding domains, according to their range value;
- label the domains with the highest rank value as the most influential within the dark Tor network.
[9]
9. System according to claim 8, comprising a computer (2) connected to the internet and configured to access the dark Tor network (1), containing the program for tracking and downloading raw text (3), the program for the automatic classification of text (6a), the construction of the Interest Activity Graph (8), the calculation of the range value for all the domains analyzed and the indication of the most relevant domains of the dark network Tor (10) a global level and by categories.
[10]
10. System according to any of claims 8 to 9, comprising data storage means where the HTML files (4), the preprocessed text (6), the categories of the analyzed domains (6b) and the archive are stored data with the lists of domains sorted according to their level of influence.
[11]
11. A program product comprising means of program instructions for carrying out the procedure defined in any one of claims 1 to 7 when the program is run on a processor.
[12]
12. A program product according to claim 11, stored in a program support medium.
one
类似技术:
公开号 | 公开日 | 专利标题
Keisler et al.2014|Value of information analysis: the state of application
Altuntas et al.2015|Analysis of patent documents with weighted association rules
Zhang et al.2012|A fuzzy‐set‐theory‐based approach to analyse species membership in DNA barcoding
Hsu et al.2013|Effect modification and design sensitivity in observational studies
Sharma et al.2014|Application of data mining–a survey paper
Fang et al.2022|Cryptocurrency trading: a comprehensive survey
CN108229590A|2018-06-29|A kind of method and apparatus for obtaining multi-tag user portrait
Zdravevski et al.2019|Automation in systematic, scoping and rapid reviews by an NLP toolkit: a case study in enhanced living environments
CN110020660A|2019-07-16|Use the integrity assessment of the unstructured process of artificial intelligence | technology
CN110299209A|2019-10-01|Similar case history lookup method, device, equipment and readable storage medium storing program for executing
Thompson et al.2018|Machine learning-based prediction of prolonged length of stay in newborns
Leo et al.2019|Machine learning model for imbalanced cholera dataset in Tanzania
Malakouti et al.2019|Predicting patient’s diagnoses and diagnostic categories from clinical-events in EHR data
Priyadarshini et al.2019|The role of IoT and big data in modern technological arena: A comprehensive study
Dwiyanti et al.2016|Handling imbalanced data in churn prediction using rusboost and feature selection |
ES2719123B2|2020-09-10|PROCEDURE AND SYSTEM FOR THE CLASSIFICATION AND DETECTION OF THE MOST INFLUENTIAL DOMAINS WITHIN THE DARK TOR NETWORK
Bruens et al.2018|Understanding the diffusion of the blockchain technology: A patent-based analysis using the tf-lag-idf for term novelty evaluation
Obodoekwe et al.2019|A comparison of machine learning methods applicable to healthcare claims fraud detection
Alam et al.2016|Developing a framework for analyzing social networks to identify human behaviours
Rath2022|Intelligent Information System for Academic Institutions: Using Big Data Analytic Approach
US20150324813A1|2015-11-12|System and method for determining by an external entity the human hierarchial structure of an rganization, using public social networks
Tsvakirai et al.2020|Investigating South Africa’s fresh peach and nectarine value proposition: Measuring progress on achieving sustainable consumption in exports
Anuradha et al.2017|PBCCUT-priority based class clustered under sampling technique approaches for imbalanced data classification
KR20160121320A|2016-10-19|System and method for searching intellectual property agent, program recording medium, and program
Moturu et al.2008|Understanding the effects of sampling on healthcare risk modeling for the prediction of future high-cost patients
同族专利:
公开号 | 公开日
ES2719123B2|2020-09-10|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题
US7454430B1|2004-06-18|2008-11-18|Glenbrook Networks|System and method for facts extraction and domain knowledge repository creation from unstructured and semi-structured documents|
US20090204610A1|2008-02-11|2009-08-13|Hellstrom Benjamin J|Deep web miner|
法律状态:
2019-07-08| BA2A| Patent application published|Ref document number: 2719123 Country of ref document: ES Kind code of ref document: A1 Effective date: 20190708 |
2020-09-10| FG2A| Definitive protection|Ref document number: 2719123 Country of ref document: ES Kind code of ref document: B2 Effective date: 20200910 |
优先权:
申请号 | 申请日 | 专利标题
ES201831145A|ES2719123B2|2018-11-26|2018-11-26|PROCEDURE AND SYSTEM FOR THE CLASSIFICATION AND DETECTION OF THE MOST INFLUENTIAL DOMAINS WITHIN THE DARK TOR NETWORK|ES201831145A| ES2719123B2|2018-11-26|2018-11-26|PROCEDURE AND SYSTEM FOR THE CLASSIFICATION AND DETECTION OF THE MOST INFLUENTIAL DOMAINS WITHIN THE DARK TOR NETWORK|
[返回顶部]