Open In App

Feature Agglomeration vs Univariate Selection in Scikit Learn

Last Updated : 25 Jan, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Selecting the most relevant characteristics for a given job is the aim of feature selection, a crucial stage in machine learning. Feature Agglomeration and Univariate Selection are two popular methods for feature selection in Scikit-Learn. These techniques aid in the reduction of dimensionality, increase model effectiveness, and maybe improve model performance.

What is Feature Agglomeration?

Character One method for reducing dimensionality is agglomeration. Combining related characteristics from the dataset reduces the amount of aggregated features while maintaining the most crucial information. When working with high-dimensional data that has a large number of characteristics, it is quite helpful.

Example:

Let’s say you have a dataset including many attributes about consumer behavior, such as the frequency of purchases, the average transaction value, and the amount of time spent on the website. By determining their relationships, these characteristics might be combined into a single feature that represents total consumer involvement via feature aggregation.

Differences from Univariate Selection:

When combining features, Feature Agglomeration takes into account their correlations, whereas Univariate Selection assesses each feature separately using specific statistical metrics.

Advantages/Disadvantages of Feature Agglomeration

Advantages:

  • Preserves the linked characteristic’s fundamental structure.
  • enhances the model’s performance in cases when the characteristics are closely connected.

Disadvantages:

  • If characteristics are not connected, it could not work well.
  • The changed characteristics could be difficult to interpret.

Applications of Feature Agglomeration

  1. Picture processing allows the aggregation of pixel values in a picture according to spatial connections.
  2. Semantic similarity allows word embeddings to be aggregated in natural language processing.

What is Univariate Selection?

The feature selection technique known as “univariate selection” assesses each feature separately. It chooses the highest-ranked characteristics for further examination or modeling after ranking the features according to a set of statistical standards.

Example:

When using Univariate Selection on a dataset containing many features, it may be necessary to rank the features according to their variance in order to choose the ones that have the largest variance since they may contain more information.

Differences from Feature Agglomeration:

While feature aggregation takes correlations between features into account and aggregates them, univariate selection handles each feature individually and chooses or ranks them according to unique qualities.

Advantages/Disadvantages of Univariate Selection

Advantages:

  • Easy to use and very computationally effective.
  • provide information on the significance of certain qualities.

Disadvantages:

  • Disregards possible connections between characteristics.
  • Features with a strong correlation between them could not work well.

Applications of Univariate Selection

  • Gene expression analysis, when it’s critical to pinpoint specific genes exhibiting notable shifts in expression.
  • Finance, to choose certain financial metrics that have a big influence on a model.

Feature agglomeration vs. univariate selection using Scikit Learn

1. Import Libraries:

The required libraries are imported here:

  • The function load_iris is used to load the Iris dataset.
  • A class for univariate feature selection is called SelectKBest.
  • f_classif: A function that determines the sample’s ANOVA F-value.
  • A class for feature agglomeration is called FeatureAgglomeration.

Python3




from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.cluster import FeatureAgglomeration


2. Load the Iris Dataset:

After loading the Iris dataset, its characteristics are kept in X, while the target labels are kept in Y.

Python3




iris = load_iris()
X, y = iris.data, iris.target


3. Feature Agglomeration:

To lower the dataset’s dimensionality, feature agglomeration is used. With n_clusters set to 2, the algorithm will attempt to divide the characteristics into two clusters. X_reduced contains the converted data.

Python3




agglomeration = FeatureAgglomeration(n_clusters=2)
X_reduced = agglomeration.fit_transform(X)


4. Univariate Selection:

ANOVA F-value is used in the application of univariate feature selection. According to k=2, just the top two traits ought to be chosen. X_k_best is where the altered data is kept.

Python3




k_best = SelectKBest(f_classif, k=2)
X_k_best = k_best.fit_transform(X, y)


5. Display the Results:

Python3




print("Original Shape:", X.shape)
print("Agglomerated Shape:", X_reduced.shape)
print("Univariate Selection Shape:", X_k_best.shape)


Output:

Original Shape: (150, 4)
Agglomerated Shape: (150, 2)
Univariate Selection Shape: (150, 2)

6. Train the model using both dataset with Agglomerative clustered dataset

Python3




from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
# DecisionTreeClassifier
tree_clf = DecisionTreeClassifier(criterion='entropy',
                                  max_depth=2)
tree_clf.fit(X_reduced, y)
pred = tree_clf.predict(X_reduced)
 
print(classification_report(y, pred, target_names=iris.target_names))


Output:

              precision    recall  f1-score   support

setosa 1.00 1.00 1.00 50
versicolor 0.96 0.88 0.92 50
virginica 0.89 0.96 0.92 50

accuracy 0.95 150
macro avg 0.95 0.95 0.95 150
weighted avg 0.95 0.95 0.95 150

6. Train the model using both dataset with Univariate feature selection dataset

Python3




# DecisionTreeClassifier
from sklearn.metrics import classification_report
tree_clf = DecisionTreeClassifier(criterion='entropy',
                                  max_depth=2)
tree_clf.fit(X_k_best, y)
pred = tree_clf.predict(X_k_best)
 
print(classification_report(y, pred, target_names=iris.target_names))


Output:

              precision    recall  f1-score   support

setosa 1.00 1.00 1.00 50
versicolor 0.91 0.98 0.94 50
virginica 0.98 0.90 0.94 50

accuracy 0.96 150
macro avg 0.96 0.96 0.96 150
weighted avg 0.96 0.96 0.96 150

As we can see from the above that Univariate feature selection has performed better as compare to agglomeratice clustering.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads