Project Idea | Availability Aware Distributed Data Deduplication

Last Updated : 31 Jul, 2018

Project Title: Availability Aware Distributed Data Deduplication
Problem Statement:
In this project, we aim to reduce the resources like storage space, I/O disk operations of the cloud vendors which are used to store and manage a large volume of data. Also, we aim to provide an environment which is highly available and reliable.

Idea/Abstract:
The number of users using the cloud storage is increasing day-by-day and hence the data stored is also increasing in exponential rate. But a lot of data is duplicate since two or more users may upload the same data (For Ex. Files/Videos shared by peoples on social networking apps). Also to make the storage system reliable and highly available, cloud storage vendors create the redundant copies of the same data uploaded by the users through replication. This huge data has to be stored in the distributed environment of the group of servers. In order to provide the efficient solution to the above issues, we are proposing deduplication strategy in the distributed environment which will take care of reliability through replication as well as removal of duplicate data through duplicate detection. We present a versatile and practical primary storage deduplication platform suitable for both replications as well as deduplication. To achieve it, we have developed a new in-memory data structure which will efficiently detect the duplicate data and also take care of replication.

Data Structure Used in this Project:
Linked List and Hashing as our In-memory Data Structure And SHA Algorithm for Deduplication and Replication.

What is data de-duplication?
Data deduplication refers to a technique for eliminating redundant data in a data set. In the process of deduplication, extra copies of the same data are deleted, leaving only one copy to be stored. Data is analysed to identify duplicate byte patterns to ensure the single instance of the duplicate part is considered and stored in the server.
Why data de-duplication?

It reduces the amount of storage needed for the given set of files.
It reduces costs and increases space efficiency in the distinct storage environment.
It reduces I/O disk operation.

Why Replication?
Replication always provides the services in a reliable and highly available fashion and it should be able to survive in a system failure without losing data. The low overhead of execution.
De-duplication Types and Levels:
Two types:

Post-process
Inline process

Two Levels:

File-level Deduplication
Block-level Deduplication

Hash Algorithm:
In this project, we are using the hash algorithm to identify “chunks” of data. A hash algorithm is a function that converts a data string into a numeric string output of fixed length. The output of a hash algorithm is irreversible i.e. we cannot generate the input string from the output of the hash algorithm. The input to a hash function is of variable length but the output generated is always of fixed size.
Commonly used hash algorithms are-

MD5
SHA256

Conclusion:
We have successfully created an IN-MEMORY data structure which is able to detect duplicate data and store only one instance of the duplicate data hence improving the resources like storage space, disk I/O operation of the cloud vendors. Also, we are able to successfully provide an environment which is highly available and more reliable to the user. Hence we have successfully implemented an availability-aware distributed data deduplication system.

Future Work:
De-dup Server Bottlenecks :

Load Balancing- For this problem, we have to create multiple main servers to balance the network traffic.
In-Memory Hash Table– If the main server fails or it reboots then the whole system will be crashed. So, to solve this issue we have to make persistent storage which can take the snapshot of the whole In-Memory Data Structure after every update immediately.
Support for file system commands like ls, chmod, chown.

GitHub Link of Project: https://github.com/andh001/project_deduplication

Team Members: