Preprocessing simply refers to perform series of operations to transform or change data. It is transformation applied to our data before feeding it to algorithm. Data processing refers to perform operations on data to retrieve, transform, or change data, especially by computer. It is technique that is used to convert raw data into clean data set.
In other words, whenever data is gathered from different sources, it is collected in raw format, which is not feasible for analysis. Then it converts raw format into readable format (graphs, documents, etc.), so that it can be interpreted by computers and utilized by employees throughout an organization.
Need for Data Preprocessing :
- It transforms raw data into meaningful information. Services of data processing require skilled professionals to apply different technologies for analyzing and data processing.
- New technologies like ML (Machine Learning) is highly dependent upon data. As data is core of these technologies, so data has to be presented in way or format that makes it easier for the technologies to understand it.
- It is simply used for achieving better results from applying model. In ML, format of data has to be in proper manner. Some specified ML model needs specified format. For example, Random forest algorithm doesn’t support NULL value. Therefore, to execute random forest algorithm, NULL values have to be managed from raw data set.
- The dataset should be formatted in such way that more than one ML and deep algorithm are executed in one dataset and then best out of them is selected.
- It increases accuracy and efficiency of an ML model as data preprocessing require tasks for cleaning data and to make it suitable for ML model.
- It provides and improves generalizability of ML model. For any ML application, data is collected or gathered through “sensors”. The sensors used can be physical devices, instruments, many software programs like web crawlers, manual surveys, etc.
Types of Data Preprocessing Technique :
- Rescale Data –
When our data consists of attributes with different scales mainly ML algorithm can be benefited from rescaling attributes. It means that all attributes of dataset have same scale so that measuring parameter of dataset maintains uniformity. This is also used for an optimization algorithm to maintain uniformity of data set.
- Binarize data –
Binarization is process that is used to transform data features of any entity into binary numbers. It is done to classify algorithms more efficiently. To convert into binary, we can transform data using binary threshold. All value above threshold is marked as 1 and all values that are equal to or below threshold are marked as 0. This is called binarizing your data. It can be helpful when you have value that you want to make Crip value.
- Data Augmentation –
Data augmentation is strategy that allows practitioners or scientists to increase diversity of available data for training models, even without collecting or gathering new data. It simply means increasing amount of data with help of information available from training data. Sometimes, we need more data as many variations possible in data to get better generalization. But dataset is not big enough to capture variation. In such cases, Data augmentation is very helpful and plays very important role.
There are various types of data augmentation given below:
- Flip :
We can flip images horizontal or vertical. Some frames do not provide functions for vertical. But we can perform vertical flip by rotating an image of 180 degrees and then perform horizontal flip.
- Scale :
The image can be scaled outworlds or inworld. While scaling outworld, size of final image is longer than original one. While scaling inworld, final image size is smaller than actual image.
- Crop :
Unlike scaling, we just randomly select section from original image. After that, we resize this selected section to original image size. This method is also called as random cropping.
- Translation :
It just involves moving image along x-axis or y-axis or both. This method of augmentation is very useful. This is because objects can be located almost anywhere in image.
- Flip :
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry ready.
- Module Coupling and Its Types
- Traceability and its types
- Management Artifacts and its Types
- Types and Components of Data Flow Diagram (DFD)
- Types of Root Causes and Benefits of RCA
- Software Engineering | CASE tool and its scope
- RUP and its Phases
- Modularity and its Properties
- Refactoring - Introduction and Its Techniques
- Reconnaissance and its Tools
- Types of Defects in Software Development
- Types of Static Analysis Methods
- Types of Feasibility Study in Software Project Development
- Types of Resources Used in Project Development
- Types of Static Testing
- Types of Software
- Different types of risks in Software Project Development
- Types of Software Platforms
- Types of Software Testing
- Difference between Database Testing and Data warehouse Testing
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.