Open In App

Adenovirus Disease Prediction for Child Healthcare Using Machine Learning

Last Updated : 23 Aug, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

The major uniqueness of this proposed work is recognizing the Adenovirus disease from the human body with the help of data so that people can be health conscious and take precautions to prevent the Adenovirus infection. Using data and Machine Learning methods, we will create a model to detect Adenovirus disease in the human body. Our goal is to raise health awareness and empower people to take safeguards against Adenovirus infections, particularly in children, in order to avoid widespread outbreaks like Covid-19. The model’s distinguishing feature is its capacity to properly identify Adenovirus infections when an individual enters their bodily parameters, potentially decreasing the necessity for physical examinations in hospitals. This is especially useful in rural locations where access to doctors may be limited.

DNA viruses known as adenoviruses commonly cause minor infections of the upper or lower respiratory tract, gastrointestinal system, or conjunctiva. Hepatitis, hemorrhagic colitis, hemorrhagic cystitis, pancreatitis, nephritis, and meningoencephalitis are uncommon symptoms of adenovirus infections. Because they lack humoral immunity, young children are more likely to contract adenovirus infections. In closed or congested environments, epidemics of adenovirus infection may affect healthy children or adults (particularly military recruits). In patients with compromised immunity, the disease is more severe and spread is more likely.

Adenovirus Disease

Adenovirus disease refers to a range of infections produced by adenoviruses, a virus family that can infect humans. Adenoviruses are classified into approximately 50 different kinds, each of which can induce a unique set of symptoms. Adenovirus infections are most common in children, but they can occur in people of all ages. The most common symptoms of adenovirus infection include:

  • Breathing Problem
  • Pink Eye
  • Pneumonia
  • Fever
  • Acute Gastroenteritis (inflammation of the stomach and intestines)
  • Dry cough
  • Sore throat
  • Bladder infection

About the Dataset

The dataset contains 5434 physical samples with 8 body parameters. Of all the collected samples there are 4484 infected (Adenovirus) and 950 healthy(non-Adenovirus). Based on the collected dataset train the Machine Learning algorithms so that the Machine Learning algorithm can work as an alternative to diagnose and predict Adenovirus and non-Adenovirus accurately.

Dataset Link: https://github.com/AmartaKundu/Machine_Learning/blob/main/Adenoviruses/Adenoviruses_Dataset.csv

The dataset contains 9 features:

Sl. No. Features Name
1. Breathing Problem
2. Pink Eye
3. Pneumonia
4. Fever
5. Acute Gastroenteritis
6. Dry Cough
7. Sore throat
8. Bladder Infection
9. Adenoviruses (To be predicted)

Adenovirus Disease Prediction

Prerequisite

Python 3

  • Pandas â€“ To load the Dataframe
  • Numpy – To use for scientific computing with Python
  • Matplotlib â€“ To visualize the data features i.e. barplot
  • Label Encoding – To used to convert categorical columns into numerical ones
  • Scikit-learnTo use data analysis

Step 1: Importing Libraries and Dataset

Here we are using the following libraries

python3




import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
  
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
  
  
adenoviruses = pd.read_csv('Adenoviruses_Dataset.csv')
  
pd.set_option('display.max_columns', None)
  
print(adenoviruses.head())


Output:

  Breathing Problem Pink Eye Pneumonia  Fever Acute Gastroenteritis   \
0 Yes Yes Yes Yes Yes
1 Yes Yes Yes Yes Yes
2 Yes Yes Yes Yes No
3 Yes Yes Yes Yes Yes
4 Yes Yes Yes Yes Yes

Dry Cough Sore throat Bladder Infection Adenoviruses
0 Yes Yes No Yes
1 Yes Yes Yes Yes
2 Yes Yes Yes Yes
3 Yes No Yes Yes
4 Yes Yes No Yes

Step 2: Check the Data Info

The dataset demonstrates that the patients who are not yet infected with the virus are classified as having an Adenovirus, but those who are already infected with the virus are classified as having a non-Adenovirus. This research uses the dataset which is mostly populated with adenovirus samples (approx. 4/5) and the remaining sample is of non-adenovirus in order to achieve better outcomes.

python3




adenoviruses.info()


Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5434 entries, 0 to 5433
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Breathing Problem 5434 non-null int64
1 Pink Eye 5434 non-null int64
2 Pneumonia 5434 non-null int64
3 Fever 5434 non-null int64
4 Acute Gastroenteritis 5434 non-null int64
5 Dry Cough 5434 non-null int64
6 Sore throat 5434 non-null int64
7 Bladder Infection 5434 non-null int64
8 Adenoviruses 5434 non-null int64
dtypes: int64(9)
memory usage: 382.2 KB

Step 3: Check the descriptive Statistical View of the Data

python3




print(adenoviruses.describe(include='all'))


Output:

       Breathing Problem Pink Eye Pneumonia  Fever Acute Gastroenteritis   \
count 5434 5434 5434 5434 5434
unique 2 2 2 2 2
top Yes Yes Yes Yes Yes
freq 3620 3620 3620 4273 2820

Dry Cough Sore throat Bladder Infection Adenoviruses
count 5434 5434 5434 5434
unique 2 2 2 2
top Yes Yes No Yes
freq 4307 3953 2920 4383

Step 4: Feature transformation

Feature transformation, also known as data transformation, is the process of modifying or manipulating the original features or variables in a dataset to make them more appropriate for analysis by machine learning algorithms. Scaling, encoding, normalisation, and dimensionality reduction are some of the techniques that can be used in this transformation. The purpose of feature transformation is to increase data quality and relevance, model performance, and the ability to draw more accurate predictions or insights from the data.

python3




from sklearn.preprocessing import LabelEncoder
e=LabelEncoder()
  
adenoviruses['Breathing Problem']=e.fit_transform(adenoviruses['Breathing Problem'])
adenoviruses['Pink Eye']=e.fit_transform(adenoviruses['Pink Eye'])
adenoviruses['Pneumonia ']=e.fit_transform(adenoviruses['Pneumonia '])
adenoviruses['Fever']=e.fit_transform(adenoviruses['Fever'])
adenoviruses['Acute Gastroenteritis ']=e.fit_transform(adenoviruses['Acute Gastroenteritis '])
adenoviruses['Dry Cough']=e.fit_transform(adenoviruses['Dry Cough'])
adenoviruses['Sore throat']=e.fit_transform(adenoviruses['Sore throat'])
adenoviruses['Bladder Infection']=e.fit_transform(adenoviruses['Bladder Infection'])
adenoviruses['Adenoviruses']=e.fit_transform(adenoviruses['Adenoviruses'])
  
print(adenoviruses.head())


Output:

   Breathing Problem  Pink Eye  Pneumonia   Fever  Acute Gastroenteritis   \
0 1 1 1 1 1
1 1 1 1 1 1
2 1 1 1 1 0
3 1 1 1 1 1
4 1 1 1 1 1

Dry Cough Sore throat Bladder Infection Adenoviruses
0 1 1 0 1
1 1 1 1 1
2 1 1 1 1
3 1 0 1 1
4 1 1 0 1

Step 5: Correlation between features

python3




corr=adenoviruses.corr()
corr.style.background_gradient(cmap='coolwarm',axis=None)


Output:

Correlation-betwenn-features-Geeksforgeeks
Correlation b/w Features

Step 6: Split the dataset

python3




from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score
  
x=adenoviruses.drop('Adenoviruses',axis=1)
y=adenoviruses['Adenoviruses']
  
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20)


Step 7: Build the model

A. Logistic Regression

python3




model = LogisticRegression()
# Fit the model
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
# Score/Accuracy
acc_logreg = model.score(x_test, y_test)*100
print(acc_logreg)


Output:

91.53633854645814

B. RandomForestRegressor

python3




model = RandomForestRegressor(n_estimators=1000)
#Fit
model.fit(x_train, y_train)
#Score/Accuracy
acc_randomforest=model.score(x_test, y_test)*100
print(acc_randomforest)


Output:

68.89109966932078

C. GradientBoostingRegressor

python3




GBR = GradientBoostingRegressor(n_estimators=100, max_depth=4)
#Fit
GBR.fit(x_train, y_train)
acc_gbk=GBR.score(x_test, y_test)*100
print(acc_gbk)


Output:

68.24771110553792

D. KNeighborsClassifier

python3




knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
#Score/Accuracy
acc_knn=knn.score(x_test, y_test)*100
print(acc_knn)


Output:

92.91628334866606

E. DecisionTreeClassifier

python3




t = tree.DecisionTreeClassifier()
t.fit(x_train,y_train)
y_pred = t.predict(x_test)
#Score/Accuracy
acc_decisiontree=t.score(x_test, y_test)*100
print(acc_decisiontree)


Output:

93.56025758969642

F. K-Nearest Neighbors(KNN)

python3




model = GaussianNB()
model.fit(x_train,y_train)
#Score/Accuracy
acc_gaussian= model.score(x_test, y_test)*100
print(acc_gaussian)


Output:

85.28058877644894

G. Support Vector Machines(SVM)

python3




#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
#Train the model using the training sets
clf.fit(x_train, y_train)
#Predict the response for test dataset
y_pred = clf.predict(x_test)
#Score/Accuracy
acc_svc=clf.score(x_test, y_test)*100
print(acc_svc)


Output:

91.90432382704692

Step 8: Model Accuracy

python3




models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression'
              'Random Forest', 'Naive Bayes',   
              'Decision Tree', 'Gradient Boosting Classifier'],
    'Score': [acc_svc, acc_knn, acc_logreg, 
              acc_randomforest, acc_gaussian, acc_decisiontree, acc_gbk]})
print(models.sort_values(by='Score', ascending=False))


Output:

                          Model      Score
5 Decision Tree 93.560258
1 KNN 92.916283
0 Support Vector Machines 91.904324
2 Logistic Regression 91.536339
4 Naive Bayes 85.280589
3 Random Forest 68.891100
6 Gradient Boosting Classifier 68.247711

Step 9:Result Analysis

To attain the best accuracy, we divided the dataset into a training set (80% of the data) and a testing set (20% of the data). We then used the decision tree approach to train the Machine Learning model. When evaluated on the remaining 20% of the dataset, the decision tree method attained an accuracy of 93.5%.

We also tested K-Nearest Neighbours (KNN), Support Vector Machines (SVM), Logistic Regression, Naive Bayes, Random Forest, and gradient boosting methods. While these algorithms performed well, their accuracy varied. KNN had an accuracy of 92.91%, SVM had an accuracy of 91.9%, Logistic Regression had an accuracy of 91.53%, Naive Bayes had an accuracy of 83.28%, Random Forest had an accuracy of 68.89%, and gradient boosting had an accuracy of 68.24%.

Overall, the decision tree method outperformed all other algorithms examined in our proposed work, providing the highest accuracy. When compared to other methodologies, its capacity to efficiently analyse and classify data resulted in greater predicted accuracy.

Step 10: Prediction

After applying all the models to the dataset, Decision Tree gives the most efficient outcome. So, use Decision Tree for this proposed work. Human adenovirus (HAdV) is a major cause of acute respiratory infections (ARIs) in children.

Test Case-1:

python3




model = t
# Real Sample Input
result = model.predict([[1,1,1,1,1,1,1,0]])
  
# Final Prediction
if result==1:
    print("The Patient is Adenovirus Positive(+ve)")
else:
    print("The Patient is Adenovirus Negative(-ve)")


Output:

The Patient is Adenovirus Positive(+ve)

Test Case-2:

python3




# Real Sample Input
result = model.predict([[0,0,1,0,0,1,0,0]])
  
# Final Prediction
if result==1:
    print("The Patient is Adenovirus Positive(+ve)")
else:
    print("The Patient is Adenovirus Negative(-ve)")


Output:

The Patient is Adenovirus Negative(-ve)

Conclusion

Currently, this model is only able to predict Adenovirus or non-Adenovirus but, in the future, we will enhance the model using a Deep Learning algorithm and as a result, the model will able to predict various diseases besides Adenovirus. This ML model is highly recommended to the health ministry and scientists for further research. this ML model is available at open source medium free of cost so that each and every individual can avail of the benefits from this model.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads