Open In App

Data analysis using R

Last Updated : 09 Dec, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

Data Analysis is a subset of data analytics, it is a process where the objective has to be made clear, collect the relevant data, preprocess the data, perform analysis(understand the data, explore insights), and then visualize it. The last step visualization is important to make people understand what’s happening in the firm.

Steps involved in data analysis:

 

The process of data analysis would include all these steps for the given problem statement. Example- Analyze the products that are being rapidly sold out and details of frequent customers of a retail shop.

  • Defining the problem statement – Understand the goal, and what is needed to be done. In this case, our problem statement is – “The product is mostly sold out and list of customers who often visit the store.” 
  • Collection of data –  Not all the company’s data is necessary, understand the relevant data according to the problem. Here the required columns are product ID, customer ID, and date visited.
  • Preprocessing – Cleaning the data is mandatory to put it in a structured format before performing analysis. 
  1. Removing outliers( noisy data).
  2. Removing null or irrelevant values in the columns. (Change null values to mean value of that column.)
  3. If there is any missing data, either ignore the tuple or fill it with a mean value of the column.

Data Analysis using the Titanic dataset

You can download the titanic dataset (it contains data from real passengers of the titanic)from here. Save the dataset in the current working directory, now we will start analysis (getting to know our data).

R




titanic=read.csv("train.csv")
head(titanic)


Output:

  PassengerId Survived Pclass                                         Name    Sex
1         892        0      3                             Kelly, Mr. James   male
2         893        1      3             Wilkes, Mrs. James (Ellen Needs) female
3         894        0      2                    Myles, Mr. Thomas Francis   male
4         895        0      3                             Wirz, Mr. Albert   male
5         896        1      3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female
6         897        0      3                   Svensson, Mr. Johan Cervin   male
   Age SibSp Parch  Ticket    Fare Cabin Embarked
1 34.5     0     0  330911  7.8292              Q
2 47.0     1     0  363272  7.0000              S
3 62.0     0     0  240276  9.6875              Q
4 27.0     0     0  315154  8.6625              S
5 22.0     1     1 3101298 12.2875              S
6 14.0     0     0    7538  9.2250              S

Our dataset contains all the columns like name, age, gender of the passenger and class they have traveled in, whether they have survived or not, etc. To understand the class(data type) of each column sapply() method can be used.

R




sapply(train, class)


Output:

PassengerId    Survived      Pclass        Name         Sex         Age 
  "integer"   "integer"   "integer" "character" "character"   "numeric" 
      SibSp       Parch      Ticket        Fare       Cabin    Embarked 
  "integer"   "integer" "character"   "numeric" "character" "character" 

We can categorize the value “survived” into “dead” to 0 and “alive” to 1 using factor() function.

R




train$Survived=as.factor(train$Survived)
train$Sex=as.factor(train$Sex)
sapply(train, class)


Output:

PassengerId    Survived      Pclass        Name         Sex         Age 
  "integer"    "factor"   "integer" "character"    "factor"   "numeric" 
      SibSp       Parch      Ticket        Fare       Cabin    Embarked 
  "integer"   "integer" "character"   "numeric" "character" "character" 

We analyze data using a summary of all the columns, their values, and data types. summary() can be used for this purpose.

R




summary(train)


Output:

  PassengerId     Survived     Pclass          Name               Sex     
 Min.   : 892.0   0:266    Min.   :1.000   Length:418         female:152  
 1st Qu.: 996.2   1:152    1st Qu.:1.000   Class :character   male  :266  
 Median :1100.5            Median :3.000   Mode  :character               
 Mean   :1100.5            Mean   :2.266                                  
 3rd Qu.:1204.8            3rd Qu.:3.000                                  
 Max.   :1309.0            Max.   :3.000                                  
                                                                          
      Age            SibSp            Parch           Ticket         
 Min.   : 0.17   Min.   :0.0000   Min.   :0.0000   Length:418        
 1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
 Median :27.00   Median :0.0000   Median :0.0000   Mode  :character  
 Mean   :30.27   Mean   :0.4474   Mean   :0.3923                     
 3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.0000                     
 Max.   :76.00   Max.   :8.0000   Max.   :9.0000                     
 NA's   :86                                                          
      Fare            Cabin             Embarked        
 Min.   :  0.000   Length:418         Length:418        
 1st Qu.:  7.896   Class :character   Class :character  
 Median : 14.454   Mode  :character   Mode  :character  
 Mean   : 35.627                                        
 3rd Qu.: 31.500                                        
 Max.   :512.329                                        
 NA's   :1

From the above summary we can extract below observations:

  • Total passengers:  891
  • The number of total people who survived:  342
  • Number of total people dead:  549
  • Number of males in the titanic:  577
  • Number of females in the titanic:  314
  • Maximum age among all people in titanic:  80
  • Median age:  28

Preprocessing of the data is important before analysis, so null values have to be checked and removed.

R




sum(is.na(train))


Output:

177

R




dropnull_train=train[rowSums(is.na(train))<=0,]


  • dropnull_train contains only 631 rows because (total rows in dataset (808) – null value rows (177) = remaining rows (631) )
  • Now we will divide survived and dead people into a separate list from 631 rows.

R




survivedlist=dropnull_train[dropnull_train$Survived == 1,]
notsurvivedlist=dropnull_train[dropnull_train$Survived == 0,]


Now we can visualize the number of males and females dead and survived using bar plots, histograms, and piecharts.

R




mytable <- table(titanic$Survived)
lbls <- paste(names(mytable), "\n", mytable, sep="")
pie(mytable,
    labels = lbls,
    main="Pie Chart of Survived column data\n (with sample sizes)")


Output:

 

From the above pie chart, we can certainly say that there is a data imbalance in the target/Survived column.

R




hist(survivedlist$Age,
     xlab="gender",
     ylab="frequency")


Output:

 

Now let’s draw a bar plot to visualize the number of males and females who were there on the titanic ship.

R




barplot(table(notsurvivedlist$Sex),
        xlab="gender",
        ylab="frequency")


Output:

 

From the barplot above we can analyze that there are nearly 350 males, and 50 females those are not survived in titanic.

R




temp<-density(table(titanic$Fare))
plot(temp, type="n",
     main="Fare charged from Passengers")
polygon(temp, col="lightgray",
        border="gray")


Output:

 

Here we can observe that there are some passengers who are charged extremely high. So, these values can affect our analysis as they are outliers. Let’s confirm their presence using a boxplot.

R




boxplot(titanic$Fare,
        main="Fare charged from passengers")


Output:

 

Certainly, there are some extreme outliers present in this dataset.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads