Simple 1D Kernel Density Estimation in Scikit Learn

Last Updated : 08 Jun, 2023

In this article, we will learn how to use Scikit learn for generating simple 1D kernel density estimation. We will first understand what is kernel density estimation and then we will look into its implementation in Python using KernelDensity class of sklearn.neighbors in scikit learn library.

Kernel density estimation

A non-parametric method for estimating the probability density function of a continuous random variable using kernels as weights is known as kernel density estimation (or KDE). It is used in a variety of tasks, like data visualization, data analysis, and machine learning. The idea behind KDE is to treat each observation point as a small probability distribution, and the density estimate is obtained by aggregating these distributions.

A kernel function is employed in KDE to estimate the density at each data point, and the separate kernel estimates are then added together to provide the final density estimate. The predicted density’s shape is determined by the kernel function, a symmetric, smooth function that is typically bell-shaped.

The formula for KDE can be expressed as follows:

$\text{KDE}(x) = \frac{1}{nh} \underset{i\rightarrow n}{\sum} K(\frac{x - x_i}{h})$

Where:

KDE(x) is the estimated density at point x.
n is the number of data points.
h is the bandwidth, which controls the smoothness of the estimated density.
x_i represents each data point in the dataset.
K(x;h) is the kernel function, which determines the contribution of each data point to the estimated density.

Some of the Commonly used kernel functions include:

Gaussian (or Normal) Kernel: K(x,h) = \frac{1}{\sqrt{2\pi}}\exp\left[\frac{(x-x_i)^2}{2h^2}\right]
Tophat Kernel: K(x;h) = \frac{1}{\sqrt{2\pi}}\exp\left(\frac{1}{2}\left|\frac{x-x_i}{h}\right|\right)
Epanechnikov Kernel: K(x;h) = \left[\frac{3}{4}\left(1 – (\frac{x – x_i}{h})^2 (|\frac{x – x_i}{h}| \leq 1\right)\right]
Exponential Kernel: K(x;h) = 0.5 \exp(-|\frac{x-x_i}{h}|)
Linear Kernel: $K(x;h) = (1 - |\frac{x-x_i}{h}|)(|\frac{x-x_i}{h}| \leq 1)$
Cosine Kernel: K(x;h) = \frac{\pi}{4}\cos( \frac{\pi}{2}\left(\frac{x-x_i}{h}\right)) (|\frac{x-x_i}{h}| \leq 1)

The kernels used in KDE are smooth, symmetric, and usually, bell-shaped functions, and the type of kernel and width of the kernel (bandwidth) determines the smoothness and accuracy of the estimated density. The estimated density at a particular point is obtained by calculating the weighted average of the kernel functions centered at each data point, with the weights determined by the distance between the point of interest and the data points.

Code Implementation

To implement this we will first import the required libraries.

Python3

import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.neighbors import KernelDensity

Now, we will look into different kernels available to use in this library.

Python3

# Different types of available kernels 
kernels = ["gaussian", "tophat", "epanechnikov", "exponential", "linear", "cosine"] 
  
# Create a figure with 3 rows and 2 columns 
fig, ax = plt.subplots(3, 2) 
  
# Set the size of the figure 
fig.set_figheight(15) 
fig.set_figwidth(10) 
  
# Set the title of the figure 
fig.suptitle("Different types of available kernels") 
  
  
# Create a 1D array of x values to plot the distribution curve 
x_plot = np.linspace(-6, 6, 1000)[:, np.newaxis] 
x_src = np.zeros((1, 1)) 
  
# Plot the distribution curve of each kernel 
for i, kernel in enumerate(kernels): 
    # Calculating the log of the probability density function 
    log_dens = KernelDensity(kernel=kernel).fit(x_src).score_samples(x_plot) 
      
    # Plot the distribution curve 
    ax[i // 2, i % 2].fill(x_plot[:, 0], np.exp(log_dens)) 
  
    # Set the title, x and y labels of the plot 
    ax[i // 2, i % 2].set_title(kernel) 
    ax[i // 2, i % 2].set_xlim(-3, 3) 
    ax[i // 2, i % 2].set_ylim(0, 1) 
    ax[i // 2, i % 2].set_ylabel("Density") 
    ax[i // 2, i % 2].set_xlabel("x") 
  
# Display the plot 
plt.show()

Output:

Probability distribution plot of different types of available kernels-Geeksforgeeks

Now, let us implement kernel density estimation on random data using a Gaussian kernel with a bandwidth of 0.5.

Python3

# Plot the 1D density curve for the gaussian kernel 
  
# Create a sample distribution 
N = 100
X = np.concatenate((np.random.normal(0, 1, int(0.6 * N)),  
                    np.random.normal(10, 1, int(0.4 * N))) 
                )[:, np.newaxis] 
X_plot = np.linspace(-5, 15, 1000)[:, np.newaxis] 
  
# Calculate the true density 
true_density = 0.6 * norm(0, 1).pdf(X_plot[:, 0]) + \ 
               0.4 * norm(10, 1).pdf(X_plot[:, 0]) 
  
# Creating a figure 
fig, ax = plt.subplots() 
  
# Plotting the true density 
ax.fill( 
    X_plot[:, 0], true_density,  
    fc='black', alpha=0.2,  
    label='Sample distribution'
) 
  
# Calculating the density using the gaussian kernel with bandwidth 0.5 
kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(X) 
  
# Calculating the log of the probability density function 
log_dens = kde.score_samples(X_plot) 
  
# Plotting the density curve 
ax.plot( 
    X_plot[:, 0], 
    np.exp(log_dens), 
    color="cornflowerblue", 
    linestyle="-", 
    label="Gaussian kernel density"
) 
  
# Set the title, x and y labels of the plot 
ax.set_title("Gaussian Kernel Density") 
ax.set_xlim(-4, 15) 
ax.set_ylim(0, 0.3) 
ax.grid(True) 
ax.legend(loc='upper right') 
  
# Display the plot 
plt.show()