Adenovirus Disease Prediction for Child Healthcare Using Machine Learning

The major uniqueness of this proposed work is recognizing the Adenovirus disease from the human body with the help of data so that people can be health conscious and take precautions to prevent the Adenovirus infection. Using data and Machine Learning methods, we will create a model to detect Adenovirus disease in the human body. Our goal is to raise health awareness and empower people to take safeguards against Adenovirus infections, particularly in children, in order to avoid widespread outbreaks like Covid-19. The model’s distinguishing feature is its capacity to properly identify Adenovirus infections when an individual enters their bodily parameters, potentially decreasing the necessity for physical examinations in hospitals. This is especially useful in rural locations where access to doctors may be limited.

DNA viruses known as adenoviruses commonly cause minor infections of the upper or lower respiratory tract, gastrointestinal system, or conjunctiva. Hepatitis, hemorrhagic colitis, hemorrhagic cystitis, pancreatitis, nephritis, and meningoencephalitis are uncommon symptoms of adenovirus infections. Because they lack humoral immunity, young children are more likely to contract adenovirus infections. In closed or congested environments, epidemics of adenovirus infection may affect healthy children or adults (particularly military recruits). In patients with compromised immunity, the disease is more severe and spread is more likely.

Adenovirus Disease

Adenovirus disease refers to a range of infections produced by adenoviruses, a virus family that can infect humans. Adenoviruses are classified into approximately 50 different kinds, each of which can induce a unique set of symptoms. Adenovirus infections are most common in children, but they can occur in people of all ages. The most common symptoms of adenovirus infection include:

Breathing Problem
Pink Eye
Pneumonia
Fever
Acute Gastroenteritis (inflammation of the stomach and intestines)
Dry cough
Sore throat
Bladder infection

About the Dataset

The dataset contains 5434 physical samples with 8 body parameters. Of all the collected samples there are 4484 infected (Adenovirus) and 950 healthy(non-Adenovirus). Based on the collected dataset train the Machine Learning algorithms so that the Machine Learning algorithm can work as an alternative to diagnose and predict Adenovirus and non-Adenovirus accurately.

Dataset Link: https://github.com/AmartaKundu/Machine_Learning/blob/main/Adenoviruses/Adenoviruses_Dataset.csv

The dataset contains 9 features:

Sl. No.	Features Name
1.	Breathing Problem
2.	Pink Eye
3.	Pneumonia
4.	Fever
5.	Acute Gastroenteritis
6.	Dry Cough
7.	Sore throat
8.	Bladder Infection
9.	Adenoviruses (To be predicted)

Adenovirus Disease Prediction

Prerequisite

Python 3

Pandas – To load the Dataframe
Numpy – To use for scientific computing with Python
Matplotlib – To visualize the data features i.e. barplot
Label Encoding – To used to convert categorical columns into numerical ones
Scikit-learn – To use data analysis

Step 1: Importing Libraries and Dataset

Here we are using the following libraries

python3

import pandas as pd 

import numpy as np 

import matplotlib 

import matplotlib.pyplot as plt 

import seaborn as sns 

import plotly.express as px 

from sklearn.model_selection import train_test_split 

from sklearn import metrics 

from sklearn.metrics import accuracy_score 

from sklearn.linear_model import LogisticRegression 

from sklearn.ensemble import RandomForestRegressor 

from sklearn.ensemble import GradientBoostingRegressor 

from sklearn.neighbors import KNeighborsClassifier 

from sklearn import tree 

from sklearn.naive_bayes import GaussianNB 

from sklearn import svm 

adenoviruses = pd.read_csv('Adenoviruses_Dataset.csv') 

pd.set_option('display.max_columns', None) 

print(adenoviruses.head())

Output:

  Breathing Problem Pink Eye Pneumonia  Fever Acute Gastroenteritis   \
0               Yes      Yes        Yes   Yes                    Yes   
1               Yes      Yes        Yes   Yes                    Yes   
2               Yes      Yes        Yes   Yes                     No   
3               Yes      Yes        Yes   Yes                    Yes   
4               Yes      Yes        Yes   Yes                    Yes   

  Dry Cough Sore throat Bladder Infection Adenoviruses  
0       Yes         Yes                No          Yes  
1       Yes         Yes               Yes          Yes  
2       Yes         Yes               Yes          Yes  
3       Yes          No               Yes          Yes  
4       Yes         Yes                No          Yes

Step 2: Check the Data Info

The dataset demonstrates that the patients who are not yet infected with the virus are classified as having an Adenovirus, but those who are already infected with the virus are classified as having a non-Adenovirus. This research uses the dataset which is mostly populated with adenovirus samples (approx. 4/5) and the remaining sample is of non-adenovirus in order to achieve better outcomes.

python3

adenoviruses.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5434 entries, 0 to 5433
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
 0   Breathing Problem       5434 non-null   int64
 1   Pink Eye                5434 non-null   int64
 2   Pneumonia               5434 non-null   int64
 3   Fever                   5434 non-null   int64
 4   Acute Gastroenteritis   5434 non-null   int64
 5   Dry Cough               5434 non-null   int64
 6   Sore throat             5434 non-null   int64
 7   Bladder Infection       5434 non-null   int64
 8   Adenoviruses            5434 non-null   int64
dtypes: int64(9)
memory usage: 382.2 KB

Step 3: Check the descriptive Statistical View of the Data

python3

print(adenoviruses.describe(include='all'))

Output:

       Breathing Problem Pink Eye Pneumonia  Fever Acute Gastroenteritis   \
count               5434     5434       5434  5434                   5434   
unique                 2        2          2     2                      2   
top                  Yes      Yes        Yes   Yes                    Yes   
freq                3620     3620       3620  4273                   2820   

       Dry Cough Sore throat Bladder Infection Adenoviruses  
count       5434        5434              5434         5434  
unique         2           2                 2            2  
top          Yes         Yes                No          Yes  
freq        4307        3953              2920         4383

Step 4: Feature transformation

Feature transformation, also known as data transformation, is the process of modifying or manipulating the original features or variables in a dataset to make them more appropriate for analysis by machine learning algorithms. Scaling, encoding, normalisation, and dimensionality reduction are some of the techniques that can be used in this transformation. The purpose of feature transformation is to increase data quality and relevance, model performance, and the ability to draw more accurate predictions or insights from the data.

python3

from sklearn.preprocessing import LabelEncoder 

e=LabelEncoder() 

adenoviruses['Breathing Problem']=e.fit_transform(adenoviruses['Breathing Problem']) 

adenoviruses['Pink Eye']=e.fit_transform(adenoviruses['Pink Eye']) 

adenoviruses['Pneumonia ']=e.fit_transform(adenoviruses['Pneumonia ']) 

adenoviruses['Fever']=e.fit_transform(adenoviruses['Fever']) 

adenoviruses['Acute Gastroenteritis ']=e.fit_transform(adenoviruses['Acute Gastroenteritis ']) 

adenoviruses['Dry Cough']=e.fit_transform(adenoviruses['Dry Cough']) 

adenoviruses['Sore throat']=e.fit_transform(adenoviruses['Sore throat']) 

adenoviruses['Bladder Infection']=e.fit_transform(adenoviruses['Bladder Infection']) 

adenoviruses['Adenoviruses']=e.fit_transform(adenoviruses['Adenoviruses']) 

print(adenoviruses.head())

Output:

   Breathing Problem  Pink Eye  Pneumonia   Fever  Acute Gastroenteritis   \
0                  1         1           1      1                       1   
1                  1         1           1      1                       1   
2                  1         1           1      1                       0   
3                  1         1           1      1                       1   
4                  1         1           1      1                       1   

   Dry Cough  Sore throat  Bladder Infection  Adenoviruses  
0          1            1                  0             1  
1          1            1                  1             1  
2          1            1                  1             1  
3          1            0                  1             1  
4          1            1                  0             1

Step 5: Correlation between features

python3

corr=adenoviruses.corr() 

corr.style.background_gradient(cmap='coolwarm',axis=None)

Output:

Correlation b/w Features

Step 6: Split the dataset

python3

from sklearn.model_selection import train_test_split 

from sklearn import metrics 

from sklearn.metrics import accuracy_score 

x=adenoviruses.drop('Adenoviruses',axis=1) 

y=adenoviruses['Adenoviruses'] 

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20)

Step 7: Build the model

A. Logistic Regression

python3

model = LogisticRegression() 
# Fit the model 
model.fit(x_train, y_train) 

y_pred = model.predict(x_test) 
# Score/Accuracy 

acc_logreg = model.score(x_test, y_test)*100

print(acc_logreg)

Output:

91.53633854645814

B. RandomForestRegressor

python3

model = RandomForestRegressor(n_estimators=1000) 
#Fit 
model.fit(x_train, y_train) 
#Score/Accuracy 

acc_randomforest=model.score(x_test, y_test)*100

print(acc_randomforest)

Output:

68.89109966932078

C. GradientBoostingRegressor

python3

GBR = GradientBoostingRegressor(n_estimators=100, max_depth=4) 
#Fit 
GBR.fit(x_train, y_train) 

acc_gbk=GBR.score(x_test, y_test)*100

print(acc_gbk)

Output:

68.24771110553792

D. KNeighborsClassifier

python3

knn = KNeighborsClassifier(n_neighbors=10) 
knn.fit(x_train, y_train) 

y_pred = knn.predict(x_test) 
#Score/Accuracy 

acc_knn=knn.score(x_test, y_test)*100

print(acc_knn)

Output:

92.91628334866606

E. DecisionTreeClassifier

python3

t = tree.DecisionTreeClassifier() 
t.fit(x_train,y_train) 

y_pred = t.predict(x_test) 
#Score/Accuracy 

acc_decisiontree=t.score(x_test, y_test)*100

print(acc_decisiontree)

Output:

93.56025758969642

F. K-Nearest Neighbors(KNN)

python3

model = GaussianNB() 
model.fit(x_train,y_train) 
#Score/Accuracy 

acc_gaussian= model.score(x_test, y_test)*100

print(acc_gaussian)

Output:

85.28058877644894

G. Support Vector Machines(SVM)

python3

#Create a svm Classifier 

clf = svm.SVC(kernel='linear') # Linear Kernel 
#Train the model using the training sets 
clf.fit(x_train, y_train) 
#Predict the response for test dataset 

y_pred = clf.predict(x_test) 
#Score/Accuracy 

acc_svc=clf.score(x_test, y_test)*100

print(acc_svc)

Output:

91.90432382704692

Step 8: Model Accuracy

python3

models = pd.DataFrame({ 

    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',  

              'Random Forest', 'Naive Bayes',    

              'Decision Tree', 'Gradient Boosting Classifier'], 

    'Score': [acc_svc, acc_knn, acc_logreg,  

              acc_randomforest, acc_gaussian, acc_decisiontree, acc_gbk]}) 

print(models.sort_values(by='Score', ascending=False))

Output:

                          Model      Score
5                 Decision Tree  93.560258
1                           KNN  92.916283
0       Support Vector Machines  91.904324
2           Logistic Regression  91.536339
4                   Naive Bayes  85.280589
3                 Random Forest  68.891100
6  Gradient Boosting Classifier  68.247711

Step 9:Result Analysis

To attain the best accuracy, we divided the dataset into a training set (80% of the data) and a testing set (20% of the data). We then used the decision tree approach to train the Machine Learning model. When evaluated on the remaining 20% of the dataset, the decision tree method attained an accuracy of 93.5%.

We also tested K-Nearest Neighbours (KNN), Support Vector Machines (SVM), Logistic Regression, Naive Bayes, Random Forest, and gradient boosting methods. While these algorithms performed well, their accuracy varied. KNN had an accuracy of 92.91%, SVM had an accuracy of 91.9%, Logistic Regression had an accuracy of 91.53%, Naive Bayes had an accuracy of 83.28%, Random Forest had an accuracy of 68.89%, and gradient boosting had an accuracy of 68.24%.

Overall, the decision tree method outperformed all other algorithms examined in our proposed work, providing the highest accuracy. When compared to other methodologies, its capacity to efficiently analyse and classify data resulted in greater predicted accuracy.

Step 10: Prediction

After applying all the models to the dataset, Decision Tree gives the most efficient outcome. So, use Decision Tree for this proposed work. Human adenovirus (HAdV) is a major cause of acute respiratory infections (ARIs) in children.

Test Case-1:

python3

model = t 
# Real Sample Input 

result = model.predict([[1,1,1,1,1,1,1,0]]) 

# Final Prediction 

if result==1: 

    print("The Patient is Adenovirus Positive(+ve)") 

else: 

    print("The Patient is Adenovirus Negative(-ve)")

Output:

The Patient is Adenovirus Positive(+ve)

Test Case-2:

python3

# Real Sample Input 

result = model.predict([[0,0,1,0,0,1,0,0]]) 

# Final Prediction 

if result==1: 

    print("The Patient is Adenovirus Positive(+ve)") 

else: 

    print("The Patient is Adenovirus Negative(-ve)")

Output:

The Patient is Adenovirus Negative(-ve)

Conclusion

Currently, this model is only able to predict Adenovirus or non-Adenovirus but, in the future, we will enhance the model using a Deep Learning algorithm and as a result, the model will able to predict various diseases besides Adenovirus. This ML model is highly recommended to the health ministry and scientists for further research. this ML model is available at open source medium free of cost so that each and every individual can avail of the benefits from this model.

Article Tags :

Data Science

Project

Machine Learning Projects