Complex Data Types in Data Mining
The Complex data types require advanced data mining techniques. Some of the Complex data types are sequence Data which includes the Time-Series, Symbolic Sequences, and Biological Sequences. The additional preprocessing steps are needed for data mining of these complex data types.
1. Time-Series Data Mining:
In time-series data, data is measured as the long series of the numerical or textual data at equal time intervals per minute, per hour, or per day. Time-series data mining is performed on the data obtained from the stock markets, scientific data, and medical data. In time series mining it is not possible to find the data that exactly matches the given query. We employ the similarity search method that finds the data sequences that are similar to the given query string. In the similarity search method, subsequence matching is performed to find the subsequences that are similar to a given query string. In order to perform the similarity search, dimensionality reduction of complex data to transform the time-series data into numerical data.
2. Sequential Pattern Mining in Symbolic Sequences:
Symbolic sequences are composed of long nominal data sequences, which dynamically change their behavior over time intervals. Examples of the Symbolic Sequences include online customer shopping sequences as well as sequences of events of experiments. Mining of Symbolic Sequences is called Sequential Mining. A sequential pattern is a subsequence that exists more frequently in a set of sequences. so it finds the most frequent subsequence in a set of sequences to perform the mining. Many scalable algorithms have been built to find out the frequent subsequence. There are also algorithms to mine the multidimensional and multilevel sequential patterns.
3. Data mining of Biological Sequences:
Biological sequences are the long sequences of nucleotides and data mining of biological sequences is required to find the features of the DNA of humans. Biological sequence analysis is the first step of data mining to compare the alignment of the biological sequences. Two species are similar to each other only if their nucleotide (DNA, RNA) and protein sequences are close and similar. During the data mining of Biological Sequences, the degree of similarity between nucleotide sequences is measured. The degree of similarity obtained by sequence alignment of nucleotides is essential in determining the homology between two sequences.
There can be the situation of alignment of two or more input biological sequences by identifying similar sequences with long subsequences. The amino acids also called proteins sequences are also compared and aligned.
4. Graph Pattern Mining:
Graph Pattern Mining can be done by using Apriori-based and pattern growth-based approaches. We can mine the subgraphs of the graph and the set of closed graphs. A closed graph g is the graph that doesn’t have a super graph that carries the same support count as g. Graph Pattern Mining is applied to different types of graphs such as frequent graphs, coherent graphs, and dense graphs. We can also improve the mining efficiency by applying the user constraints on the graph patterns. Graph patterns are two types. Homogeneous graphs where nodes or links of the graph are of the same type by having similar features. In Heterogeneous graph patterns, the nodes and links are of different types.
5. Statistical Modeling of Networks:
A network is a collection of nodes where each node represents the data and the nodes are linked through edges, representing relationships between data objects. If all the nodes and links connecting the nodes are of the same type, then the network is homogeneous such as a friend network or a web page network. If the nodes and links connecting the nodes are of different types, then the network is heterogeneous such as health-care networks (linking the different parameters such as doctors, nurses, patients, diseases together in the network). Graph Pattern Mining can be further applied to the network to derive the knowledge and useful patterns from the network.
6. Mining Spatial Data:
Spatial data is the geo space-related data that is stored in large data repositories. The spatial data is represented in “vector” format and geo-referenced multimedia format. A spatial database is constructed from large geographic data warehouses by integrating geographical data of multiple sources of areas. we can construct spatial data cubes that contain information about the spatial dimensions and measures. It is possible to perform the OLAP operations on the spatial data for spatial data analysis. Spatial data mining is performed on spatial data warehouses, spatial databases, and other geospatial data repositories. Spatial Data mining discovers knowledge about the geographic areas. The preprocessing of spatial data involves several operations like spatial clustering, spatial classification, spatial modeling, and outlier detection in spatial data.
7. Mining Cyber-Physical System Data:
Cyber-Physical System Data can be mined by constructing a graph or network of data. A cyber-physical system (CPS) is a heterogeneous network that consists of a large number of interconnected nodes that store patients or medical information. The links in the CPS network represent the relationship between the nodes . cyber-physical systems store dynamic, inconsistent, and interdependent data that contains spatiotemporal information. Mining cyber-physical data links the situation as a query to access the data from a large information database and it involves real-time calculations and analysis to prompt responses from the CPS system. CPS analysis requires rare-event detection and anomaly analysis in cyber-physical data streams, in cyber-physical networks, and the processing of Cyber-Physical Data involves the integration of stream data with real-time automated control processes.
8. Mining Multimedia Data:
Multimedia data objects include image data, video data, audio data, website hyperlinks, and linkages. Multimedia data mining tries to find out interesting patterns from multimedia databases. This includes the processing of the digital data and performs tasks like image processing, image classification, video, and audio data mining, and pattern recognition. Multimedia Data mining is becoming the most interesting research area because most of the social media platforms like Twitter, Facebook data can be analyzed through this and derive interesting trends and patterns.
9. Mining Web Data:
Web mining is essential to discover crucial patterns and knowledge from the Web. Web content mining analyzes data of several websites which includes the web pages and the multimedia data such as images in the web pages. Web mining is done to understand the content of web pages, unique users of the website, unique hypertext links, web page relevance and ranking, web page content summaries, time that the users spent on the particular website, and understand user search patterns. Web mining also finds out the best search engine and determines the search algorithm used by it. So it helps improve search efficiency and finds the best search engine for the users.
10. Mining Text Data:
Text mining is the subfield of data mining, machine learning, Natural Language processing, and statistics. Most of the information in our daily life is stored as text such as news articles, technical papers, books, email messages, blogs. Text Mining helps us to retrieve high-quality information from text such as sentiment analysis, document summarization, text categorization, text clustering. We apply machine learning models and NLP techniques to derive useful information from the text. This is done by finding out the hidden patterns and trends by means such as statistical pattern learning and statistical language modeling. In order to perform text mining, we need to preprocess the text by applying the techniques of stemming and lemmatization in order to convert the textual data into data vectors.
11. Mining Spatiotemporal Data:
The data that is related to both space and time is Spatiotemporal data. Spatiotemporal data mining retrieves interesting patterns and knowledge from spatiotemporal data. Spatiotemporal Data mining helps us to find the value of the lands, the age of the rocks and precious stones, predict the weather patterns. Spatiotemporal data mining has many practical applications like GPS in mobile phones, timers, Internet-based map services, weather services, satellite, RFID, sensor.
12. Mining Data Streams:
Stream data is the data that can change dynamically and it is noisy, inconsistent which contain multidimensional features of different data types. So this data is stored in NoSql database systems. The volume of the stream data is very high and this is the challenge for the effective mining of stream data. While mining the Data Streams we need to perform the tasks such as clustering, outlier analysis, and the online detection of rare events in data streams.