ML | Binning or Discretization
Real-world data tend to be noisy. Noisy data is data with a large amount of additional meaningless information in it called noise. Data cleaning (or data cleansing) routines attempt to smooth out noise while identifying outliers in the data.
There are three data smoothing techniques as follows –
- Binning : Binning methods smooth a sorted data value by consulting its “neighborhood”, that is, the values around it.
- Regression : It conforms data values to a function. Linear regression involves finding the “best” line to fit two attributes (or variables) so that one attribute can be used to predict the other.
- Outlier analysis : Outliers may be detected by clustering, for example, where similar values are organized into groups, or “clusters”. Intuitively, values that fall outside of the set of clusters may be considered as outliers.
Binning method for data smoothing –
Here, we are concerned with the Binning method for data smoothing. In this method the data is first sorted and then the sorted values are distributed into a number of buckets or bins. As binning methods consult the neighborhood of values, they perform local smoothing.
There are basically two types of binning approaches –
- Equal width (or distance) binning : The simplest binning approach is to partition the range of the variable into k equal-width intervals. The interval width is simply the range [A, B] of the variable divided by k,
w = (B-A) / k
Thus, ith interval range will be
[A + (i-1)w, A + iw]where i = 1, 2, 3…..k
Skewed data cannot be handled well by this method.
- Equal depth (or frequency) binning : In equal-frequency binning we divide the range [A, B] of the variable into intervals that contain (approximately) equal number of points; equal frequency may not be possible due to repeated values.
How to perform smoothing on the data?
There are three approaches to perform smoothing –
- Smoothing by bin means : In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
- Smoothing by bin median : In this method each bin value is replaced by its bin median value.
- Smoothing by bin boundary : In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.
Sorted data for price(in dollar) : 2, 6, 7, 9, 13, 20, 21, 24, 30
Partition using equal frequency approach: Bin 1 : 2, 6, 7 Bin 2 : 9, 13, 20 Bin 3 : 21, 24, 30 Smoothing by bin mean : Bin 1 : 5, 5, 5 Bin 2 : 14, 14, 14 Bin 3 : 25, 25, 25 Smoothing by bin median : Bin 1 : 6, 6, 6 Bin 2 : 13, 13, 13 Bin 3 : 24, 24, 24 Smoothing by bin boundary : Bin 1 : 2, 7, 7 Bin 2 : 9, 9, 20 Bin 3 : 21, 21, 30
Binning can also be used as a discretization technique. Here discretization refers to the process of converting or partitioning continuous attributes, features or variables to discretized or nominal attributes/features/variables/intervals.
For example, attribute values can be discretized by applying equal-width or equal-frequency binning, and then replacing each bin value by the bin mean or median, as in smoothing by bin means or smoothing by bin medians, respectively. Then the continuous values can be converted to a nominal or discretized value which is same as the value of their corresponding bin.
Below is the Python implementation: