Data Mining in Science and Engineering
Data mining is an automatic process of uncovering implicit patterns, correlations, anomalies, and statistical information within large amounts of data stored in repositories. This information can be interpreted by hypothesis or theory and used to make forecasts. It is an interdisciplinary area that incorporates ideas from a range of mathematical and computational disciplines including statistics, machine learning and database retrieval, optimization and visualization methods, and more. Data mining can help discover relationships and trend-related insights that cannot be provided by basic query and reporting techniques. The term data mining is often used synonymously with KDD, or knowledge data discovery, which in fact refers to a more general process of which mining is a component.
Much of science now is becoming data intensive. The transformative capability that data science has provided to science has been referred to as ‘The Fourth Paradigm’.
The volume of available data is growing exponentially; and so is its volume, velocity, and veracity. This proliferation of data today has made it too large in size and dimensionality to be directly analyzed by humans, which makes data mining an indispensable tool for scientific research projects across multifarious domains: from astronomy and bioinformatics to finance and social sciences. Data mining can be used to make pertinent conclusions and predictions from the colossal volume of otherwise impenetrable scientific data which is collected and stored every single day.
Applications of Data Mining in Science and Engineering:
- Data reduction: Scientific instruments like satellites and microscopes can easily acquire millions of data points and generate terabytes of data at high speeds. A methodical, automated approach can simplify the observations without corrupting the quality of information. Data mining techniques can serve as an effective interface between scientists and massive datasets.
- Research: Web data mining simplifies the process of digging knowledgeable and user-queried information from inconsistent and unstructured data on the internet. Text data mining involves using tools like natural language processing (NLP) to acquire structured information from the text specifically. These applications enable researchers to find extant scientific data from literature databases in a faster and more accurate way.
- Pattern recognition: Intelligent algorithms can detect patterns in datasets that humans can’t due to high dimensionality. This can also help discover anomalies.
- Remote sensing: Data mining techniques are applicable on aerial remote sensing imagery for automatic land-cover classification, and for nighttime light, remote sensing is used to research socioeconomic domains.
- Opinion mining: A subfield of natural language processing, information retrieval, and text mining, opinion mining is the process of extracting human thoughts and perceptions from unstructured texts, which can be used to analyze the sentiments of social media users.
Application area of Data Mining Techniques:
- High Energy Physics: Experiments involving collisions simulated within accelerators and detectors at the Large Hadron Collider record petabytes of data that need to be stored, calibrated, and reconstructed before it can be analyzed. The Worldwide LHC Computing Grid deals with the volume by employing data reduction algorithms. Special high-performance software called ROOT is an open-source data mining tool that facilitates scientific analyses and visualization of large amounts of data.
- Astronomy: Classifying cosmological objects with completeness and efficiency is a process that utilizes data mining algorithms, used for star-galaxy separation, galaxy morphology, and other types of classifications. The estimation of redshifts from photometric data for galaxies and quasars uses the template approach or the empirical set training method. Apart from these applications, data mining has also been used to analyze cosmic microwave backgrounds, forecast solar flares, and performed astronomical simulations.
- Bioinformatics: Bioinformatics is a science at the intersection of biology and information technology. Data generated in genomics and proteomics research can be mined for finding motifs in sequences, predicting protein structures, genomic annotation, analyzing gene/protein expression, modeling biological systems, and exploring genetic mechanisms to understand diseases at a deeper level.
- Healthcare: The data generated by the healthcare industry includes useful information on patient demographics, treatment plans, payment, and insurance coverage. Existent studies have recorded applications of data mining in clinical medicine, adverse drug reaction signal detection, and focus on diabetes and skin diseases. The most frequently used mining techniques in this category are regression, classification, sequential pattern mining, association, clustering, and data warehousing.
- Geo-Spatial Analysis: Data mining algorithms have been used for generating spatial maps of storm dust provenance to mitigate its effects in arid environments, locations susceptible to gully erosions which trigger land degradation have been spatially modeled using GIS and R programming,
For more application areas of data mining please refer to the article Applications of Data Mining.