Open In App

Adenovirus Disease Prediction for Child Healthcare Using Machine Learning

The major uniqueness of this proposed work is recognizing the Adenovirus disease from the human body with the help of data so that people can be health conscious and take precautions to prevent the Adenovirus infection. Using data and Machine Learning methods, we will create a model to detect Adenovirus disease in the human body. Our goal is to raise health awareness and empower people to take safeguards against Adenovirus infections, particularly in children, in order to avoid widespread outbreaks like Covid-19. The model’s distinguishing feature is its capacity to properly identify Adenovirus infections when an individual enters their bodily parameters, potentially decreasing the necessity for physical examinations in hospitals. This is especially useful in rural locations where access to doctors may be limited.

DNA viruses known as adenoviruses commonly cause minor infections of the upper or lower respiratory tract, gastrointestinal system, or conjunctiva. Hepatitis, hemorrhagic colitis, hemorrhagic cystitis, pancreatitis, nephritis, and meningoencephalitis are uncommon symptoms of adenovirus infections. Because they lack humoral immunity, young children are more likely to contract adenovirus infections. In closed or congested environments, epidemics of adenovirus infection may affect healthy children or adults (particularly military recruits). In patients with compromised immunity, the disease is more severe and spread is more likely.



Adenovirus Disease

Adenovirus disease refers to a range of infections produced by adenoviruses, a virus family that can infect humans. Adenoviruses are classified into approximately 50 different kinds, each of which can induce a unique set of symptoms. Adenovirus infections are most common in children, but they can occur in people of all ages. The most common symptoms of adenovirus infection include:

About the Dataset

The dataset contains 5434 physical samples with 8 body parameters. Of all the collected samples there are 4484 infected (Adenovirus) and 950 healthy(non-Adenovirus). Based on the collected dataset train the Machine Learning algorithms so that the Machine Learning algorithm can work as an alternative to diagnose and predict Adenovirus and non-Adenovirus accurately.



Dataset Link: https://github.com/AmartaKundu/Machine_Learning/blob/main/Adenoviruses/Adenoviruses_Dataset.csv

The dataset contains 9 features:

Sl. No. Features Name
1. Breathing Problem
2. Pink Eye
3. Pneumonia
4. Fever
5. Acute Gastroenteritis
6. Dry Cough
7. Sore throat
8. Bladder Infection
9. Adenoviruses (To be predicted)

Adenovirus Disease Prediction

Prerequisite

Python 3

Step 1: Importing Libraries and Dataset

Here we are using the following libraries




import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
  
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
  
  
adenoviruses = pd.read_csv('Adenoviruses_Dataset.csv')
  
pd.set_option('display.max_columns', None)
  
print(adenoviruses.head())

Output:

  Breathing Problem Pink Eye Pneumonia  Fever Acute Gastroenteritis   \
0 Yes Yes Yes Yes Yes
1 Yes Yes Yes Yes Yes
2 Yes Yes Yes Yes No
3 Yes Yes Yes Yes Yes
4 Yes Yes Yes Yes Yes

Dry Cough Sore throat Bladder Infection Adenoviruses
0 Yes Yes No Yes
1 Yes Yes Yes Yes
2 Yes Yes Yes Yes
3 Yes No Yes Yes
4 Yes Yes No Yes

Step 2: Check the Data Info

The dataset demonstrates that the patients who are not yet infected with the virus are classified as having an Adenovirus, but those who are already infected with the virus are classified as having a non-Adenovirus. This research uses the dataset which is mostly populated with adenovirus samples (approx. 4/5) and the remaining sample is of non-adenovirus in order to achieve better outcomes.




adenoviruses.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5434 entries, 0 to 5433
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Breathing Problem 5434 non-null int64
1 Pink Eye 5434 non-null int64
2 Pneumonia 5434 non-null int64
3 Fever 5434 non-null int64
4 Acute Gastroenteritis 5434 non-null int64
5 Dry Cough 5434 non-null int64
6 Sore throat 5434 non-null int64
7 Bladder Infection 5434 non-null int64
8 Adenoviruses 5434 non-null int64
dtypes: int64(9)
memory usage: 382.2 KB

Step 3: Check the descriptive Statistical View of the Data




print(adenoviruses.describe(include='all'))

Output:

       Breathing Problem Pink Eye Pneumonia  Fever Acute Gastroenteritis   \
count 5434 5434 5434 5434 5434
unique 2 2 2 2 2
top Yes Yes Yes Yes Yes
freq 3620 3620 3620 4273 2820

Dry Cough Sore throat Bladder Infection Adenoviruses
count 5434 5434 5434 5434
unique 2 2 2 2
top Yes Yes No Yes
freq 4307 3953 2920 4383

Step 4: Feature transformation

Feature transformation, also known as data transformation, is the process of modifying or manipulating the original features or variables in a dataset to make them more appropriate for analysis by machine learning algorithms. Scaling, encoding, normalisation, and dimensionality reduction are some of the techniques that can be used in this transformation. The purpose of feature transformation is to increase data quality and relevance, model performance, and the ability to draw more accurate predictions or insights from the data.




from sklearn.preprocessing import LabelEncoder
e=LabelEncoder()
  
adenoviruses['Breathing Problem']=e.fit_transform(adenoviruses['Breathing Problem'])
adenoviruses['Pink Eye']=e.fit_transform(adenoviruses['Pink Eye'])
adenoviruses['Pneumonia ']=e.fit_transform(adenoviruses['Pneumonia '])
adenoviruses['Fever']=e.fit_transform(adenoviruses['Fever'])
adenoviruses['Acute Gastroenteritis ']=e.fit_transform(adenoviruses['Acute Gastroenteritis '])
adenoviruses['Dry Cough']=e.fit_transform(adenoviruses['Dry Cough'])
adenoviruses['Sore throat']=e.fit_transform(adenoviruses['Sore throat'])
adenoviruses['Bladder Infection']=e.fit_transform(adenoviruses['Bladder Infection'])
adenoviruses['Adenoviruses']=e.fit_transform(adenoviruses['Adenoviruses'])
  
print(adenoviruses.head())

Output:

   Breathing Problem  Pink Eye  Pneumonia   Fever  Acute Gastroenteritis   \
0 1 1 1 1 1
1 1 1 1 1 1
2 1 1 1 1 0
3 1 1 1 1 1
4 1 1 1 1 1

Dry Cough Sore throat Bladder Infection Adenoviruses
0 1 1 0 1
1 1 1 1 1
2 1 1 1 1
3 1 0 1 1
4 1 1 0 1

Step 5: Correlation between features




corr=adenoviruses.corr()
corr.style.background_gradient(cmap='coolwarm',axis=None)

Output:

Correlation b/w Features

Step 6: Split the dataset




from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score
  
x=adenoviruses.drop('Adenoviruses',axis=1)
y=adenoviruses['Adenoviruses']
  
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20)

Step 7: Build the model

A. Logistic Regression




model = LogisticRegression()
# Fit the model
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
# Score/Accuracy
acc_logreg = model.score(x_test, y_test)*100
print(acc_logreg)

Output:

91.53633854645814

B. RandomForestRegressor




model = RandomForestRegressor(n_estimators=1000)
#Fit
model.fit(x_train, y_train)
#Score/Accuracy
acc_randomforest=model.score(x_test, y_test)*100
print(acc_randomforest)

Output:

68.89109966932078

C. GradientBoostingRegressor




GBR = GradientBoostingRegressor(n_estimators=100, max_depth=4)
#Fit
GBR.fit(x_train, y_train)
acc_gbk=GBR.score(x_test, y_test)*100
print(acc_gbk)

Output:

68.24771110553792

D. KNeighborsClassifier




knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
#Score/Accuracy
acc_knn=knn.score(x_test, y_test)*100
print(acc_knn)

Output:

92.91628334866606

E. DecisionTreeClassifier




t = tree.DecisionTreeClassifier()
t.fit(x_train,y_train)
y_pred = t.predict(x_test)
#Score/Accuracy
acc_decisiontree=t.score(x_test, y_test)*100
print(acc_decisiontree)

Output:

93.56025758969642

F. K-Nearest Neighbors(KNN)




model = GaussianNB()
model.fit(x_train,y_train)
#Score/Accuracy
acc_gaussian= model.score(x_test, y_test)*100
print(acc_gaussian)

Output:

85.28058877644894

G. Support Vector Machines(SVM)




#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
#Train the model using the training sets
clf.fit(x_train, y_train)
#Predict the response for test dataset
y_pred = clf.predict(x_test)
#Score/Accuracy
acc_svc=clf.score(x_test, y_test)*100
print(acc_svc)

Output:

91.90432382704692

Step 8: Model Accuracy




models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression'
              'Random Forest', 'Naive Bayes',   
              'Decision Tree', 'Gradient Boosting Classifier'],
    'Score': [acc_svc, acc_knn, acc_logreg, 
              acc_randomforest, acc_gaussian, acc_decisiontree, acc_gbk]})
print(models.sort_values(by='Score', ascending=False))

Output:

                          Model      Score
5 Decision Tree 93.560258
1 KNN 92.916283
0 Support Vector Machines 91.904324
2 Logistic Regression 91.536339
4 Naive Bayes 85.280589
3 Random Forest 68.891100
6 Gradient Boosting Classifier 68.247711

Step 9:Result Analysis

To attain the best accuracy, we divided the dataset into a training set (80% of the data) and a testing set (20% of the data). We then used the decision tree approach to train the Machine Learning model. When evaluated on the remaining 20% of the dataset, the decision tree method attained an accuracy of 93.5%.

We also tested K-Nearest Neighbours (KNN), Support Vector Machines (SVM), Logistic Regression, Naive Bayes, Random Forest, and gradient boosting methods. While these algorithms performed well, their accuracy varied. KNN had an accuracy of 92.91%, SVM had an accuracy of 91.9%, Logistic Regression had an accuracy of 91.53%, Naive Bayes had an accuracy of 83.28%, Random Forest had an accuracy of 68.89%, and gradient boosting had an accuracy of 68.24%.

Overall, the decision tree method outperformed all other algorithms examined in our proposed work, providing the highest accuracy. When compared to other methodologies, its capacity to efficiently analyse and classify data resulted in greater predicted accuracy.

Step 10: Prediction

After applying all the models to the dataset, Decision Tree gives the most efficient outcome. So, use Decision Tree for this proposed work. Human adenovirus (HAdV) is a major cause of acute respiratory infections (ARIs) in children.

Test Case-1:




model = t
# Real Sample Input
result = model.predict([[1,1,1,1,1,1,1,0]])
  
# Final Prediction
if result==1:
    print("The Patient is Adenovirus Positive(+ve)")
else:
    print("The Patient is Adenovirus Negative(-ve)")

Output:

The Patient is Adenovirus Positive(+ve)

Test Case-2:




# Real Sample Input
result = model.predict([[0,0,1,0,0,1,0,0]])
  
# Final Prediction
if result==1:
    print("The Patient is Adenovirus Positive(+ve)")
else:
    print("The Patient is Adenovirus Negative(-ve)")

Output:

The Patient is Adenovirus Negative(-ve)

Conclusion

Currently, this model is only able to predict Adenovirus or non-Adenovirus but, in the future, we will enhance the model using a Deep Learning algorithm and as a result, the model will able to predict various diseases besides Adenovirus. This ML model is highly recommended to the health ministry and scientists for further research. this ML model is available at open source medium free of cost so that each and every individual can avail of the benefits from this model.


Article Tags :