According to the definition given in Wikipedia, Anscombe’s quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties.
Once Francis John “Frank” Anscombe who was a statistician of great repute found 4 sets of 11 data-points in his dream and requested the council as his last wish to plot those points. Those 4 sets of 11 data-points are given below.
After that, the council analyzed them using only descriptive statistics and found the mean, standard deviation, and correlation between x and y.
Please download the csv file here.
Code: Python program to find mean, standard deviation, and the correlation between x and y
9.0 3.32 7.5 2.03 0.816
So let me show you the result in a tabular fashion for better understanding.
Code: Python program to plot scatter plot
For regression line refer this.
Note: It is mentioned in the definition that Anscombe’s quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed.
Explanation of this output:
- In the first one(top left) if you look at the scatter plot you will see that there seems to be a linear relationship between x and y.
- In the second one(top right) if you look at this figure you can conclude that there is a non-linear relationship between x and y.
- In the third one(bottom left) you can say when there is a perfect linear relationship for all the data points except one which seems to be an outlier which is indicated be far away from that line.
- Finally, the fourth one(bottom right) shows an example when one high-leverage point is enough to produce a high correlation coefficient.
The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.