What is Unstructured Data?
Unstructured data is the data which does not conforms to a data model and has no easily identifiable structure such that it can not be used by a computer program easily. Unstructured data is not organised in a pre-defined manner or does not have a pre-defined data model, thus it is not a good fit for a mainstream relational database.
Characteristics of Unstructured Data:
- Data neither conforms to a data model nor has any structure.
- Data can not be stored in the form of rows and columns as in Databases
- Data does not follows any semantic or rules
- Data lacks any particular format or sequence
- Data has no easily identifiable structure
- Due to lack of identifiable structure, it can not used by computer programs easily
Sources of Unstructured Data:
- Web pages
- Images (JPEG, GIF, PNG, etc.)
- Word documents and PowerPoint presentations
Advantages of Unstructured Data:
- Its supports the data which lacks a proper format or sequence
- The data is not constrained by a fixed schema
- Very Flexible due to absence of schema.
- Data is portable
- It is very scalable
- It can deal easily with the heterogeneity of sources.
- These type of data have a variety of business intelligence and analytics applications.
Disadvantages Of Unstructured data:
- It is difficult to store and manage unstructured data due to lack of schema and structure
- Indexing the data is difficult and error prone due to unclear structure and not having pre-defined attributes. Due to which search results are not very accurate.
- Ensuring security to data is difficult task.
Problems faced in storing unstructured data:
- It requires a lot of storage space to store unstructured data.
- It is difficult to store videos, images, audios, etc.
- Due to unclear structure, operations like update, delete and search is very difficult.
- Storage cost is high as compared to structured data
- Indexing the unstructured data is difficult
Possible solution for storing Unstructured data:
- Unstructured data can be converted to easily manageable formats
- using Content addressable storage system (CAS) to store unstructured data.
It stores data based on their metadata and a unique name is assigned to every object stored in it.The object is retrieved based on content not its location.
- Unstructured data can be stored in XML format.
- Unstructured data can be stored in RDBMS which supports BLOBs
Extracting information from unstructured Data:
unstructured data do not have any structure. So it can not easily interpreted by conventional algorithms. It is also difficult to tag and index unstructured data. So extracting information from them is tough job. Here are possible solutions:
- Taxonomies or classification of data helps in organising data in hierarchical structure. Which will make search process easy.
- Data can be stored in virtual repository and be automatically tagged. For example Documentum.
- Use of application platforms like XOLAP.
XOLAP helps in extracting information from e-mails and XML based documents
- Use of various data mining tools
To read Differences between Structured, Semi-structured and Unstructured data refer the following article: