Open In App

Data Mining Graphs and Networks

Data mining is the process of collecting and processing data from a heap of unprocessed data. When the patterns are established, various relationships between the datasets can be identified and they can be presented in a summarized format which helps in statistical analysis in various industries. Among the other data structures, the graph is widely used in modeling advanced structures and patterns. In data mining, the graph is used to find subgraph patterns for discrimination, classification, clustering of data, etc. The graph is used in network analysis. By linking the various nodes, graphs form network-like communications, web and computer networks, social networks, etc. In multi-relational data mining, graphs or networks is used because of the varied interconnected relationship between the datasets in a relational database. 

Graph mining

Graph mining is a process in which the mining techniques are used in finding a pattern or relationship in the given real-world collection of graphs. By mining the graph, frequent substructures and relationships can be identified which helps in clustering the graph sets, finding a relationship between graph sets, or discriminating or characterizing graphs. Predicting these patterning trends can help in building models for the enhancement of any application that is used in real-time. To implement the process of graph mining, one must learn to mine frequent subgraphs.



Frequent Subgraph Mining 

Let us consider a graph h with an edge set E(h) and a vertex set V(h). Let us consider the existence of subgraph isomorphism from h to h’ in such a way that h is a subgraph of h’. A label function is a function that plots either the edges or vertices to a label. Let us consider a labeled graph dataset, Let us consider s(h) as the support which means the percentage of graphs in F where h is a subgraph. A frequent graph has support that will be no less than the minimum support threshold. Let us denote it as min_support.

Steps in finding frequent subgraphs:



There are two steps in finding frequent subgraphs. 

There are two methods for frequent substructure mining.

The Apriori-based approach: The approach to find the frequent graphs begin from the graph with a small size. The approach advances in a bottom-up way by creating candidates with extra vertex or edge. This algorithm is called an Apriori Graph. Let us consider Qk as the frequent sub-structure set with a size of k. This approach acquires a level-wise mining technique. Before the Apriori Graph technique, the generation of candidates must be done. This is done by combining two same but slightly varied frequent subgraphs. After the formation of new substructures, the frequency of the graph is checked. Out of that, the graphs found frequently are used to create the next candidate. This step to generate frequent substructure candidates is a complex step. But, when it comes to generating candidates in itemset, it is easy and effortless. Let’s consider an example of having two itemsets of size three such that  and  So, the itemset derived using join would be pqrs. But when it comes to substructures, there is more than one method to join two substructures.

Algorithm: 

This approach is based on frequent substructure mining.

Input:
F= a graph data set.
min_support= minimum support threshold
Output:
Q1,Q2,Q3,....QK,
 a frequent substructure set graphs with the size range from 1 to k.
Q1 <- all the frequent 1 subgraphs in F;
k <- 2;
while Qk-1 ≠ ∅ do
 Qk <- ∅;
 Gk <- candidate_generation(Qk-1);
 foreach candidate l ∈ Gk do
  l.count <- 0;
  foreach Fi ∈ F do
    if isomerism_subgraph(l,Fi) then
    l.count <- l.count+1;
    end
  end
  if l.count ≥ min_support(F) ∧ l∉Qk then
     Qk = Qk U l;
  end
 end
 k <- k+1;
end

It is an iterative method in which the first candidate generation takes place followed by the support computation. The subgraphs are generated using subgraph isomorphism. Thus frequent subgraphs are generated by efficiently using this approach which helps in FSM. Apriori approach uses BFS(Breadth-First Search) due to the iterative level-wise generation of candidates. This is necessary because if you want to mine the k+1 graph, you should have already mined till k subgraphs.   

The Pattern- growth approach: This pattern-growth approach can use both BFS and DFS(Depth First Search). DFS is preferred for this approach due to its less memory consumption nature. Let us consider a graph h. A new graph can be formed by adding an edge e. The edge can introduce a vertex but it is not a need. If it introduces a vertex, it can be done in two ways, forward and backward. The Pattern-growth graph is easy but it is not that efficient. Because there is a possibility of creating a similar graph that is already created which leads to computation inefficiency. The duplicate graphs generated can be removed but it increases the time and work. To avoid the creation of duplicate graphs, the frequent graphs should be introduced very carefully and conservatively which calls the need for other algorithms. 

Algorithm:

The below algorithm is a pattern-growth-based frequent substructure mining with a simplistic approach. If you need to search without duplicating, you must go with a different algorithm with gSpan.

Input:
q= a frequent graph
F= a graph data set.
min_support= minimum support threshold
Output: 
P = the frequent graph set
P <- ∅;
Call patterngrow_graph(q, F, min_support, P);
procedure patterngrow_graph(q, F, min_support, P)
 if q ∈ P then return;
 else insert q into P;
 scan F once, find all the edges e such that q can be extended to q -> e;
 for each frequent q -> e do
 PatternGrowthGraph(q -> e, D, min_support, P);
 return;

An edge e is used to extend a new graph from the old one q. The newly extend graph is denoted as  The extension can either be backward or forward. 

Constraint-based substructure mining

According to the request of the user, the constraints described changes in the mining process. But, if we generalize and categorize them into specific constraints, the mining process would be handled easily by pushing them into the given frameworks.  constraint-pushing strategy is used in pattern growth mining tasks. Let’s see some important constraint categories.

Network Analysis

In the concept of network analysis, the relationship between the units is called links in a graph. From the data mining outlook, this is called link mining or link analysis. The network is a diversional dataset with a multi-relational concept in form of a graph. The graph is very large with nodes as objects, edges as links which in turn denote the relationship between the nodes or objects. Telephone networking systems, WWW( World Wide Web) are very good examples. It also helps in filtering the datasets and providing customer-preferred services. Every network consists of numerous nodes. The datasets are widely enormous. Thus by studying and mining useful information from a wide group of datasets would help in solving problems and effective transmission of data.

Link Mining

There are some conventional methods of machine learning in which taking homogeneous objects from one relationship is taken. But in networks, this is not applicable due to a large number of nodes and its multi-relational, heterogeneous nature. Thus the link mining has appeared as a new field after many types of research. Link mining is the convergence of multiple research held in graph mining, networks, hypertexts, logic programming, link analysis, predictive analysis, and modeling. Links are nothing but the relationship between nodes in a network. With the help of links, the mining process can be held efficiently. This calls for the various functions to be done. 

Challenges in Link Mining

Community Mining

Network analysis includes the finding of objects which are in groups that share similar attributes. This process is known as community mining. In the web page linkage, the introduction of community where a group of web pages is made which follow a common theme. Many community mining algorithms decide that there is only one network and it tries to establish a homogeneous relationship. But in the real world web pages, there are multiple networks with heterogeneous relationships. This proves the need for multi-relational community mining.


Article Tags :