Types of Sources of Data in Data Mining
In this post, we will discuss what are different sources of data that are used in data mining process. The data from multiple sources are integrated into a common source known as Data Warehouse. Let’s discuss what type of data can be mined:
- Flat Files
- Flat files is defined as data files in text form or binary form with a structure that can be easily extracted by data mining algorithms.
- Data stored in flat files have no relationship or path among themselves, like if a relational database is stored on flat file, then there will be no relations between the tables.
- Flat files are represented by data dictionary. Eg: CSV file.
- Flat files are a type of structured data that are stored in a plain text format. They are called “flat” because they have no hierarchical structure, unlike a relational database table. Flat files typically consist of rows and columns of data, with each row representing a single record and each column representing a field or attribute within that record. They can be stored in various formats such as CSV, tab-separated values (TSV) and fixed-width format.
- Flat files are often used as a simple and efficient way to transfer data between different systems or applications. They are also used for storing small to medium-sized data sets. Flat files are easy to create, read, and edit, and can be processed using simple programs such as text editors, spreadsheet programs, and basic programming languages.
- Some disadvantages of flat files include the lack of data integrity checks and the inability to handle complex relationships between data. Flat files are also less efficient for handling large data sets, as they can take up a lot of space on disk and require a lot of memory to process.
In summary, flat files are a simple and efficient way to store and transfer small to medium-sized data sets, but they are not well-suited for large data sets or complex data relationships.
- Application: Used in DataWarehousing to store data, Used in carrying data to and from server, etc.
- Relational Databases
- A Relational database is defined as the collection of data organized in tables with rows and columns.
- Physical schema in Relational databases is a schema which defines the structure of tables.
- Logical schema in Relational databases is a schema which defines the relationship among tables.
- Standard API of relational database is SQL.
- A relational database is a type of structured data that organizes data into one or more tables, with each table consisting of rows and columns. The rows represent individual records, and the columns represent fields or attributes within those records.
- The main feature of a relational database is the ability to establish relationships between different tables using a common field called a primary key. This allows data to be linked and queried across multiple tables, enabling more efficient data retrieval and manipulation.
- Relational databases are widely used in many different industries, such as finance, healthcare, retail and e-commerce. They are also used to support transactional systems, data warehousing, and business intelligence.
- Relational databases are typically managed by a database management system (DBMS) such as MySQL, Oracle, SQL Server, and PostgreSQL. The DBMS provides tools for creating, modifying, and querying the database, as well as for managing access and security.
- Some advantages of relational databases include:
- Data Integrity: Relational databases have built-in mechanisms for maintaining data integrity, such as constraints and triggers
Data Consistency: Relational databases ensure that the data is consistent across the entire system
Data Security: Relational databases provide various levels of access control and security features to protect the data
Efficient Data Retrieval: Relational databases provide a powerful query language (SQL) to retrieve data efficiently
Scalability: Relational databases can be easily scaled to support large data sets and high-performance requirements
Some disadvantages of relational databases include:
- Complexity: Relational databases can be complex to set up and manage, especially for large and complex data sets
Latency: Relational databases may not be well-suited for real-time, high-throughput data processing
- Application: Data Mining, ROLAP model, etc.
- A datawarehouse is defined as the collection of data integrated from multiple sources that will queries and decision making.
- There are three types of datawarehouse: Enterprise datawarehouse, Data Mart and Virtual Warehouse.
- Two approaches can be used to update data in DataWarehouse: Query-driven Approach and Update-driven Approach.
- Application: Business decision making, Data mining, etc.
- Transactional Databases
- Transactional databases is a collection of data organized by time stamps, date, etc to represent transaction in databases.
- This type of database has the capability to roll back or undo its operation when a transaction is not completed or committed.
- Highly flexible system where users can modify information without changing any sensitive information.
- Follows ACID property of DBMS.
- Application: Banking, Distributed systems, Object databases, etc.
- Multimedia Databases
- Multimedia databases consists audio, video, images and text media.
- They can be stored on Object-Oriented Databases.
- They are used to store complex information in a pre-specified formats.
- Application: Digital libraries, video-on demand, news-on demand, musical database, etc.
- Spatial Database
- Store geographical information.
- Stores data in the form of coordinates, topology, lines, polygons, etc.
- Application: Maps, Global positioning, etc.
- Time-series Databases
- Time series databases contains stock exchange data and user logged activities.
- Handles array of numbers indexed by time, date, etc.
- It requires real-time analysis.
- Application: eXtremeDB, Graphite, InfluxDB, etc.
- WWW refers to World wide web is a collection of documents and resources like audio, video, text, etc which are identified by Uniform Resource Locators (URLs) through web browsers, linked by HTML pages, and accessible via the Internet network.
- It is the most heterogeneous repository as it collects data from multiple resources.
- It is dynamic in nature as Volume of data is continuously increasing and changing.
- Application: Online shopping, Job search, Research, studying, etc.
- Structured Data: This type of data is organized into a specific format, such as a database table or spreadsheet. Examples include transaction data, customer data, and inventory data.
- Semi-Structured Data: This type of data has some structure, but not as much as structured data. Examples include XML and JSON files, and email messages.
- Unstructured Data: This type of data does not have a specific format, and can include text, images, audio, and video. Examples include social media posts, customer reviews, and news articles.
- External Data: This type of data is obtained from external sources such as government agencies, industry reports, weather data, satellite images, GPS data, etc.
- Time-Series Data: This type of data is collected over time, such as stock prices, weather data, and website visitor logs.
- Streaming Data: This type of data is generated continuously, such as sensor data, social media feeds, and log files.
- Relational Data: This type of data is stored in a relational database, and can be accessed through SQL queries.
- NoSQL Data: This type of data is stored in a NoSQL database, and can be accessed through a variety of methods such as key-value pairs, document-based, column-based or graph-based.
- Cloud Data: This type of data is stored and processed in cloud computing environments such as AWS, Azure, and GCP.
- Big Data: This type of data is characterized by its huge volume, high velocity, and high variety, and can be stored and processed using big data technologies such as Hadoop and Spark.