Open In App

A Complete Guide to the Built-in Datasets in R

Last Updated : 16 Apr, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

R is a very famous open-source programming language in the fields of Statistical computing, data analytics, data visualization, and Machine Learning. R is now being used in fields like Data Mining and Bio-informatics. R comes with several packages that allow users to use different functions and tools in R. Along with these R has some pre-built datasets for its users. These datasets cover a wide range of fields from biology to social records. If you are new to the field of R programming then you can use these datasets to learn using R. You can perform various operations and visualizations on the built-in datasets.

Check the article on R Tutorial | Learn R Programming Language for a better understanding of R programming.

Built-in Datasets in R

There are several built-in datasets in R. These datasets are useful for beginners to practice model building, visualization, and other data analytic operations. To check the list of built-in datasets in R, run the following command in the R console.

R
data()

Output:

Data sets in package ‘datasets’:

AirPassengers Monthly Airline Passenger Numbers
1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales)
Sales Data with Leading Indicator
BOD Biochemical Oxygen Demand
CO2 Carbon Dioxide Uptake in Grass Plants
ChickWeight Weight versus age of chicks on different
diets
DNase Elisa assay of DNase
EuStockMarkets Daily Closing Prices of Major European
Stock Indices, 1991-1998
Formaldehyde Determination of Formaldehyde
HairEyeColor Hair and Eye Color of Statistics Students
Harman23.cor Harman Example 2.3
Harman74.cor Harman Example 7.4
Indometh Pharmacokinetics of Indomethacin
InsectSprays Effectiveness of Insect Sprays
JohnsonJohnson Quarterly Earnings per Johnson & Johnson
Share
LakeHuron Level of Lake Huron 1875-1972
LifeCycleSavings Intercountry Life-Cycle Savings Data
Loblolly Growth of Loblolly pine trees
Nile Flow of the River Nile
Orange Growth of Orange Trees
OrchardSprays Potency of Orchard Sprays
PlantGrowth Results from an Experiment on Plant Growth
Puromycin Reaction Velocity of an Enzymatic Reaction
Seatbelts Road Casualties in Great Britain 1969-84
Theoph Pharmacokinetics of Theophylline
Titanic Survival of passengers on the Titanic
ToothGrowth The Effect of Vitamin C on Tooth Growth in
Guinea Pigs
UCBAdmissions Student Admissions at UC Berkeley
UKDriverDeaths Road Casualties in Great Britain 1969-84
UKgas UK Quarterly Gas Consumption
USAccDeaths Accidental Deaths in the US 1973-1978
USArrests Violent Crime Rates by US State
USJudgeRatings Lawyers' Ratings of State Judges in the US
Superior Court
USPersonalExpenditure
Personal Expenditure Data
UScitiesD Distances Between European Cities and
Between US Cities
VADeaths Death Rates in Virginia (1940)
WWWusage Internet Usage per Minute
WorldPhones The World's Telephones
ability.cov Ability and Intelligence Tests
airmiles Passenger Miles on Commercial US Airlines,
1937-1960
airquality New York Air Quality Measurements
anscombe Anscombe's Quartet of 'Identical' Simple
Linear Regressions
attenu The Joyner-Boore Attenuation Data
attitude The Chatterjee-Price Attitude Data
austres Quarterly Time Series of the Number of
Australian Residents
beaver1 (beavers) Body Temperature Series of Two Beavers
beaver2 (beavers) Body Temperature Series of Two Beavers
cars Speed and Stopping Distances of Cars
chickwts Chicken Weights by Feed Type
co2 Mauna Loa Atmospheric CO2 Concentration
crimtab Student's 3000 Criminals Data
discoveries Yearly Numbers of Important Discoveries
esoph Smoking, Alcohol and (O)esophageal Cancer
euro Conversion Rates of Euro Currencies
euro.cross (euro) Conversion Rates of Euro Currencies
eurodist Distances Between European Cities and
Between US Cities
faithful Old Faithful Geyser Data
fdeaths (UKLungDeaths)
Monthly Deaths from Lung Diseases in the
UK
freeny Freeny's Revenue Data
freeny.x (freeny) Freeny's Revenue Data
freeny.y (freeny) Freeny's Revenue Data
infert Infertility after Spontaneous and Induced
Abortion
iris Edgar Anderson's Iris Data
iris3 Edgar Anderson's Iris Data
islands Areas of the World's Major Landmasses
ldeaths (UKLungDeaths)
Monthly Deaths from Lung Diseases in the
UK
lh Luteinizing Hormone in Blood Samples
longley Longley's Economic Regression Data
lynx Annual Canadian Lynx trappings 1821-1934
mdeaths (UKLungDeaths)
Monthly Deaths from Lung Diseases in the
UK
morley Michelson Speed of Light Data
mtcars Motor Trend Car Road Tests
nhtemp Average Yearly Temperatures in New Haven
nottem Average Monthly Temperatures at
Nottingham, 1920-1939.............................................................

Use ‘data(package = .packages(all.available = TRUE))’
to list the data sets in all *available* packages.

These datasets are available under datasets package. These are the commonly referred as the built-in dataset in R. This contains some of the popular datasets that we will discuss later. Now, to check all the built-in datasets available in all the installed packages of R environment run the following command.

R
data(package = .packages(all.available = TRUE))

Output:

Data sets in package ‘ade4’:

abouheif.eg Phylogenies and quantitative traits from
Abouheif
acacia Spatial pattern analysis in plant
communities
aminoacyl Codon usage
apis108 Allelic frequencies in ten honeybees
populations at eight microsatellites loci
aravo Distribution of Alpine plants in Aravo
(Valloire, France)
ardeche Fauna Table with double (row and column)
partitioning
arrival Arrivals at an intensive care unit
atlas Small Ecological Dataset
atya Genetic variability of Cacadors
avijons Bird species distribution
avimedi Fauna Table for Constrained Ordinations
aviurba Ecological Tables Triplet
bacteria Genomes of 43 Bacteria
banque Table of Factors
baran95 African Estuary Fishes
bf88 Cubic Ecological Data
bordeaux Wine Tasting
bsetal97 Ecological and Biological Traits
buech Buech basin
butterfly Genetics-Ecology-Environment Triple
capitales Road Distances
carni19 Phylogeny and quantative trait of
carnivora
carni70 Phylogeny and quantitative traits of
carnivora

As you can we are getting built-in datasets from all installed packages in R. The packages are ‘ape’, ‘bit64’, ‘boot’, and more. This also includes the dataset in package ‘datasets‘.

Count number of Datasets

There is no direct way to get the count of datasets available in R. What we can do, is either count the datasets manually or we can do the followings,

  1. Get the list of datasets.
  2. Store the List in a variable
  3. Get the variable length and print it.

Let’s check the number of datasets available under the datasets package.

R
# List datasets from all installed packages
listofdata <- data()$results[, "Item"]

# Count the number of datasets
len <- length(listofdata)
print(len)

Output:

[1] 104

And we get the output as 104 which means there are 104 datasets available in the dataset package.

Popular built-in Datasets in R

There are several built in datasets available in R which are famous among R programmers for learning and testing purpose. Following are examples of few commonly used famous built-in datasets in R.

  1. iris: This is the most famous built-in dataset available in R environment. This is a classic dataset which contains information about measurements of 3 species of iris flowers. This dataset was provided by Sir Ronald Fisher who is considered as one of the greatest biologist. This dataset is commonly used in Data analysis and Classification.
  2. mtcars: This dataset contains data about various popular car models in 1973 – ’74. There are 11 characteristics include, number of cylinders, horsepower, etc. This dataset has 32 rows of data containing information about 32 different cars.
  3. airquality: Airquality dataset has air quality records of New York city in 1973. This dataset has 6 columns including Ozone, Solar, Wind, Temp, etc. There are 154 observations recorded in this dataset.
  4. USArrests: This is also one of the famous datasets available in R. This dataset contains information about the arrests for various crimes in United States during 1973. This has observations for each of the 50 states about various crimes. This dataset is commonly used for descriptive statistics.
  5. AirPassengers: Airpassenger is a classical time series dataset that contains the number of monthly passengers of international airlines from 1949 to 1960. This dataset is very important for time series analysis, forecasting and modeling.

Let’s use a built-in dataset from the above datasets and try to do basic data operations on the dataset. In this example, we will see how we can get access to a built-in dataset and perform some analytics and visualizations.

Load the Dataset

For the example purpose, let’s use the iris dataset. To load the iris dataset, run the following command

R
data("iris")
head(iris)

Output:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

Specifying the dataset name in data() function we can access the built-in dataset. The head() function here takes the dataset name and show us the first 6 rows of the dataset. similarly we can use tail() function to get last few rows.

R
tail(iris)

Output:

    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica

As you can observe that we are getting the last 6 rows of data from the iris dataset.

Analyze the dataset

Let’s check the number of rows and columns available in the dataset.

R
dim(iris)

Output:

[1] 150   5

We can see from the output that the iris dataset contains 150 observations on 5 attributes.

Check the attribute names ( column names) of the dataset

R
names(iris)

Output:

[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
[5] "Species"

The output returns us the names of attributes in the iris dataset which are, “Sepal.Length”, “Sepal.Width”, “Petal.Length”, “Petal.Width”, and “Species“. The “Species” column also has three different species.

To check the species names we can use

R
unique(iris$Species)

Output:

[1] setosa     versicolor virginica 
Levels: setosa versicolor virginica

This shows the 3 different species in the dataset which are setosa, versicolor, and virginica.

Let’s get a summary of the whole dataset

R
summary(iris)

Output:

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

This summary result provides a great insight on the iris dataset where we get the minimum and maximum for each column containing numerical values. For the Species attribute, we can see that all the species contains 50 observations each which is equal number of observations.

Visualize the Dataset

Visualize the dataset using scatterplots, where the plot displays individual data points on a 2D system. Let’s use the plot() function of R to built a scatterplot to better understand the relationship between Sepal Length and Sepal Width.

R
plot(iris$Sepal.Length, iris$Sepal.Width, main = "Sepal Length vs. Sepal Width",
     xlab = "Sepal Length", ylab = "Sepal Width", col = iris$Species)

Output:

gh

A Complete Guide to the Built-in Datasets in R

You can see the the distributions on the above picture. The different species are denoted on the plot using three different color.

  • Let’s create a Histogram to see the distribution of data for the Petal Length. We will be using the hist() function of R. inside the function we will specify the data attribute, name of plot, and labels.
R
hist(iris$Petal.Length, 
     main = "Histogram of Petal Length", 
     xlab = "Petal Length", 
     ylab = "Frequency",
     col = "lightblue")

Output:

gh

A Complete Guide to the Built-in Datasets in R


Conclusion

The in-built dataset provides better learning experience for beginners to learn R programming and use different formulas, models on the dataset. In this article you have seen what are the famous built-in datasets available in R. Then we have learned how we can access a dataset and perform various analyzation, operations and visualizations using the in-built dataset in R.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads