Project Title: Availability Aware Distributed Data Deduplication
In this project, we aim to reduce the resources like storage space, I/O disk operations of the cloud vendors which are used to store and manage a large volume of data. Also, we aim to provide an environment which is highly available and reliable.
The number of users using the cloud storage is increasing day-by-day and hence the data stored is also increasing in exponential rate. But a lot of data is duplicate since two or more users may upload the same data (For Ex. Files/Videos shared by peoples on social networking apps). Also to make the storage system reliable and highly available, cloud storage vendors create the redundant copies of the same data uploaded by the users through replication. This huge data has to be stored in the distributed environment of the group of servers. In order to provide the efficient solution to the above issues, we are proposing deduplication strategy in the distributed environment which will take care of reliability through replication as well as removal of duplicate data through duplicate detection. We present a versatile and practical primary storage deduplication platform suitable for both replications as well as deduplication. To achieve it, we have developed a new in-memory data structure which will efficiently detect the duplicate data and also take care of replication.
Data Structure Used in this Project:
Linked List and Hashing as our In-memory Data Structure And SHA Algorithm for Deduplication and Replication.
What is data de-duplication?
Data deduplication refers to a technique for eliminating redundant data in a data set. In the process of deduplication, extra copies of the same data are deleted, leaving only one copy to be stored. Data is analysed to identify duplicate byte patterns to ensure the single instance of the duplicate part is considered and stored in the server.
Why data de-duplication?
- It reduces the amount of storage needed for the given set of files.
- It reduces costs and increases space efficiency in the distinct storage environment.
- It reduces I/O disk operation.
Replication always provides the services in a reliable and highly available fashion and it should be able to survive in a system failure without losing data. The low overhead of execution.
De-duplication Types and Levels:
- Inline process
- File-level Deduplication
- Block-level Deduplication
In this project, we are using the hash algorithm to identify “chunks” of data. A hash algorithm is a function that converts a data string into a numeric string output of fixed length. The output of a hash algorithm is irreversible i.e. we cannot generate the input string from the output of the hash algorithm. The input to a hash function is of variable length but the output generated is always of fixed size.
Commonly used hash algorithms are-
We have successfully created an IN-MEMORY data structure which is able to detect duplicate data and store only one instance of the duplicate data hence improving the resources like storage space, disk I/O operation of the cloud vendors. Also, we are able to successfully provide an environment which is highly available and more reliable to the user. Hence we have successfully implemented an availability-aware distributed data deduplication system.
De-dup Server Bottlenecks :
- Load Balancing- For this problem, we have to create multiple main servers to balance the network traffic.
- In-Memory Hash Table– If the main server fails or it reboots then the whole system will be crashed. So, to solve this issue we have to make persistent storage which can take the snapshot of the whole In-Memory Data Structure after every update immediately.
- Support for file system commands like ls, chmod, chown.
GitHub Link of Project: https://github.com/andh001/project_deduplication
- Prashant Sonsale (7276176311, firstname.lastname@example.org)
- Anand Fakatkar (8237516939, email@example.com)
- Nishant Agrawal (9921822904, firstname.lastname@example.org)
- Aditya Khowla (9762977289, email@example.com)
Note: This project idea is contributed for ProGeek Cup 2.0- A project competition by GeeksforGeeks.
- Project Idea | Office Room Availability
- Project Idea | Distributed Downloading System
- Project Idea | (Project Approval System)
- Amazon product availability checker using Python
- Project Idea | Department Data Analysis Mobile Application
- Project Idea | (A.T.L.A.S: App Time Limit Alerting System)
- Project Idea | (Remote Lab Assistance)
- Project Idea | (Model based Image Compression of Medical Images)
- Project Idea | (Personalized real-time update system)
- Project Idea | ( Character Recognition from Image )
- Project Idea | (Static Code Checker for C++)
- Project Idea | (CSE Webnode)
- Project Idea | (Optimization of Object-Based Image Analysis with Super-Pixel for Land Cover Mapping)
- Project Idea | (Online Course Registration)
- Project Idea | (Trip Planner)
- Project Idea | (Online UML Designing Tool)
- Project Idea | (Detection of Malicious Network activity)
- Project Idea | (Games using Hand Gestures)
- Project Idea | (Dynamic Hand Gesture Recognition using neural network)
- Project Idea | ( Client Master)
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.