Unstructured data is the data which does not conforms to a data model and has no easily identifiable structure such that it can not be used by a computer program easily. Unstructured data is not organised in a pre-defined manner or does not have a pre-defined data model, thus it is not a good fit for a mainstream relational database.
Characteristics of Unstructured Data:
- Data neither conforms to a data model nor has any structure.
- Data can not be stored in the form of rows and columns as in Databases
- Data does not follows any semantic or rules
- Data lacks any particular format or sequence
- Data has no easily identifiable structure
- Due to lack of identifiable structure, it can not used by computer programs easily
Sources of Unstructured Data:
- Web pages
- Images (JPEG, GIF, PNG, etc.)
- Word documents and PowerPoint persentations
Advantages of Unstructured Data:
- Its supports the data which lacks a proper format or sequence
- The data is not constrained by a fixed schema
- Very Flexible due to absence of schema.
- Data is portable
- It is very scalable
- It can deal easily with the heterogeneity of sources.
- These type of data have a variety of business intelligence and analytics applications.
Disadvantages Of Unstructured data:
- It is difficult to store and manage unstructured data due to lack of schema and structure
- Indexing the data is difficult and error prone due to unclear structure and not having pre-defined attributes. Due to which search results are not very accurate.
- Ensuring security to data is difficult task.
Problems faced in storing unstructured data:
- It requires a lot of storage space to store unstructured data.
- It is difficult to store videos, images, audios, etc.
- Due to unclear structure, operations like update, delete and search is very difficult.
- Storage cost is high as compared to structured data
- Indexing the unstructured data is difficult
Possible solution for storing Unstructured data:
- Unstructured data can be converted to easily manageable formats
- using Content addressable storage system (CAS) to store unstructured data.
It stores data based on their metadata and a unique name is assigned to every object stored in it.The object is retrieved based on content not its location.
- Unstructured data can be stored in XML format.
- Unstructured data can be stored in RDBMS which supports BLOBs
Extracting information from unstructured Data:
unstructured data do not have any structure. So it can not easily interpreted by conventional algorithms. It is also difficult to tag and index unstructured data. So extracting information from them is tough job. Here are possible solutions:
- Taxonomies or classification of data helps in organising data in hierarchical structure. Which will make search process easy.
- Data can be stored in virtual repository and be automatically tagged. For example Documentum.
- Use of application platforms like XOLAP.
XOLAP helps in extracting information from e-mails and XML based documents
- Use of various data mining tools
To read Differences between Structured, Semi-structured and Unstructured data refer the following article:
- Large objects(LOBs) for Semi Structured and Unstructured Data
- Difference between Structured, Semi-structured and Unstructured data
- Characteristics of Biological Data (Genome Data Management)
- Difference between Data Warehouse and Data Mart
- Difference between Data Warehousing and Data Mining
- Types of Sources of Data in Data Mining
- Data Preprocessing in Data Mining
- Data Abstraction and Data Independence
- Data Mining | Set 2
- Data with Hadoop
- What is Structured Data?
- Data Warehousing
- Data Mining
- Dimensional Data Modeling
- ETL Process in Data Warehouse
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.