Open In App

Statistics Cheat Sheet

Statistics is like a toolkit we use to understand and make sense of information. It helps us collect, organize, analyze, and interpret data to find patterns, trends, and relationships in the world around us.

In this Statistics cheat sheet, you will find simplified complex statistical concepts, with clear explanations, practical examples, and essential formulas. This cheat sheet will make things easy when getting ready for an interview or just starting with data science. It explains stuff like mean, median, and hypothesis testing with examples, so you’ll get it in no time. With this cheat sheet, you’ll feel more sure about your stats skills and do great in interviews and real-life data jobs!



What is Statistics?

Statistics is the branch of mathematics that deals with collecting, analyzing, interpreting, presenting, and organizing data. It involves the study of methods for gathering, summarizing, and interpreting data to make informed decisions and draw meaningful conclusions.



Statistics is widely used in various fields such as science, economics, social sciences, business, and engineering to provide insights, make predictions, and guide decision-making processes. Statistics is like a tool that helps us see patterns, trends, and relationships in the world around us. Whether it’s counting how many people like pizza or figuring out the average score on a test, statistics helps us make decisions based on data. It is used in lots of different areas, like science, business, and even sports, to help us learn more about the world and make better choices.

Types of Statistics

There are commonly two types of statistics, which are discussed below:

  1. Descriptive Statistics: De­scriptive Statistics helps us simplify and organize big chunks of data. This makes large amounts of data easier to understand.
  2. Inferential Statistics: Inferential Statistics is a little different. It uses smaller data to conclude a larger group. It helps us predict and draw conclusions about a population.

Basics of Statistics

Basic formulas of statistics are,

Parameters

Definition

Formulas

Population Mean, (μ) 

Entire group for which information is required.

∑x/N
Sample Mean

Subset of population as entire population is too large to handle.

∑x/n
Sample/Population Standard Deviation

Standard Deviation is a measure that shows how much variation from the mean exists.

Sample/Population Variane

Variance is the measure of spread of data along its central values.

Class Interval(CI)

Class interval refers to the range of values assigned to a group of data points.

Class Interval = Upper Limit – Lower Limit

Frequency(f)

Number of time any particular value appears in a data set is called frequency of that value.

f is number of times any value comes in a article

Range, (R)

Range is the difference between the largest and smallest values of the data set

Range = (Largest Data Value – Smallest Data Value)

What is Data in Statistics?

Data is a collection of observations, it can be in the form of numbers, words, measurements, or statements.

Types of Data

  1. Qualitative Data: This data is descriptive. For example – She is beautiful, He is tall, etc.
  2. Quantitative Data: This is numerical information. For example- A horse has four legs.

Types of Quantitative Data

  1. Discrete Data: It has a particular fixed value and can be counted.
  2. Continuous Data: It is not fixed but has a range of data and can be measured.

Measure of Central Tendency

Measure of Dispersion

Measure of Shape

Kurtosis

Kurtosis quantifies the degree to which a probability distribution deviates from the normal distribution. It assesses the “tailedness” of the distribution, indicating whether it has heavier or lighter tails than a normal distribution. High kurtosis implies more extreme values in the distribution, while low kurtosis indicates a flatter distribution.

Types of Kurtosis

Types of Kurtosis

  1. Mesokurtic:
    • A mesokurtic distribution has kurtosis equal to 3. This is considered the baseline or normal level of kurtosis. The distribution has tails and a peak similar to the normal distribution (bell curve).
  2. Leptokurtic:
    • A leptokurtic distribution has kurtosis greater than 3. This indicates that the distribution has fatter tails and a sharper peak compared to the normal distribution. It implies that the data has more extreme values or outliers.
  3. Platykurtic:
    • A platykurtic distribution has kurtosis less than 3. In this case, the distribution has thinner tails and a flatter peak compared to the normal distribution. It suggests that the data has fewer extreme values and is more dispersed.

Skewness

Skewness is the measure of asymmetry of probability distribution about its mean.

Right Skew:

Left Skew:

Zero Skew:

Types of Skewed data

Measure of Relationship

Probability Theory

Here are some basic concepts or terminologies used in probability:

Term Definition
Sample Space The set of all possible outcomes in a probability experiment. For instance, in a coin toss, it’s “head” and “tail”.
Sample Point One of the possible results in an experiment. For example, in rolling a fair six-sided dice, sample points are 1 to 6.
Experiment A process or trial with uncertain results. Examples include coin tossing, card selection, or rolling a die.
Event A subset of the sample space representing certain outcomes. Example: getting “1” when rolling a die.
Favorable Outcome An outcome that produces the desired or expected consequence.

Various other probability formulas are,

Joint Probability (Intersection of Event)

Probability of occurring events A and B

P(A and B) = P(A) × P(B)

Union of Events

Probability of occurring events A or B

P(A or B) = P(A) + P(B) – P(A and B)

Conditional Probability

Probability of occurring events A when event B has occurred

P(A | B) = P(A and B)/P(B)

Bayes Theorem

Bayes’ Theorem is a fundamental concept in probability theory that relates conditional probabilities. It is named after the Reverend Thomas Bayes, who first introduced the theorem. Bayes’ Theorem is a mathematical formula that provides a way to update probabilities based on new evidence. The formula is as follows:

where

Types of Probability Functions

Probability Distributions Functions

Normal or Gaussian Distribution

The normal distribution is a continuous probability distribution characterized by its bell-shaped curve and can be by described by mean (μ) and standard deviation (σ).

Formula: 

There is a empirical rule in normal distribution, which states that:

These rule is used to detect outliers.

Central Limit Theorem

The Central Limit Theorem (CLT) states that, regardless of the shape of the original population distribution, the sampling distribution of the sample mean will be approximately normally distributed if the sample size tends to infinity.

Student t-distribution

The t-distribution, also known as Student’s t-distribution, is a probability distribution that is used in statistics.

where,

Chi-square Distribution

The chi-squared distribution, denoted as is a probability distribution used in statistics it is related to the sum of squared standard normal deviates.

Binomial Distribution

The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials, where each trial has the same probability of success (p).

Formula: 

Assuming each trial is an independent event with a success probability of p=0.5, and we are calculating the probability of getting 3 successes in 6 trials: 

Poisson Distribution

The Poisson distribution models the number of events that occur in a fixed interval of time or space. It’s characterized by a single parameter (λ), the average rate of occurrence.

Formula: 

For the previous dataset, assuming the average rate of waiting time is λ=10, and we are calculating the probability of waiting exactly 12 minutes: 

Uniform Distribution

The uniform distribution represents a constant probability for all outcomes in a given range.

Formula: 

For the same previous dataset, assuming the bus arrives uniformly between 5 and 18 minutes so the probability of waiting less than 15 minutes: 

Parameter estimation for Statistical Inference

Hypothesis Testing

Hypothesis testing makes inferences about a population parameter based on sample statistic.

Null Hypothesis (H₀) and Alternative Hypothesis (H₁)

Degrees of freedom

Degrees of freedom (df) in statistics represent the number of values or quantities in the final calculation of a statistic that are free to vary. It is mainly defined as sample size – one(n-1).

Level of Significance()

This is the threshold used to determine statistical significance. Common values are 0.05, 0.01, or 0.10.

p-value

The p-value, short for probability value, is a fundamental concept in statistics that quantifies the evidence against a null hypothesis.

Type I Error and Type II Error

Type I Error that occurs when the null hypothesis is true, but the statistical test incorrectly rejects it. It is often referred to as a “false positive” or “alpha error.”

Type II Error that occurs when the null hypothesis is false, but the statistical test fails to reject it. It is often referred to as a “false negative.”

Confidence Intervals

A confidence interval is a range of values that is used to estimate the true value of a population parameter with a certain level of confidence. It provides a measure of the uncertainty or margin of error associated with a sample statistic, such as the sample mean or proportion.

Example of Hypothesis testing:

Let us consider An e-commerce company wants to assess whether a recent website redesign has a significant impact on the average time users spend on their website.

The company collects the following data:

The Hypothesis are defined as:

Significance Level:

Choose a significance level, α=0.05(commonly used)

Test Statistic and P-Value:

Result:

Interpretations:

Based on the analysis, the company draws conclusions about whether the website redesign has a statistically significant impact on user session duration.

Statistical Tests:

Parametric test are statistical methods that make assumption that the data follows normal distribution.

Z-test t-test F-test
Testing if the mean of a sample is significantly different from a known population mean Comparing means of two independent samples or testing if the mean of a sample is significantly different from a known or hypothesized population mean Comparing the variances of multiple groups to assess if they are significantly different
Used when the population standard deviation is known, and the sample size is sufficiently large. Used when the population standard deviation is unknown or when dealing with small sample sizes Used to compare variances between two or more groups.

One-Sample Test:
Z =


Two-Sample Test:

Z =


One- sample:

t =

Two-Sample Test:

Paired t-Test:

t=
d= difference

 

ANOVA (Analysis Of Variance)

Source of Variation

Sum of Squares

Degrees Of Freedom

Mean Squares

F-Value

Between Groups

SSB=

df1=k-1

MSB= SSB/ (k-1)

f=MSB/MSE

Error

SSE=

df2=N-1

MSE=SSE/(N-k)


Total

SST= SSE+SSE

df3=N-1



There are mainly two types of ANOVA:

  1. One-way Anova: Used to compare means of three or more groups to determine if there are statistically significant differences among them.
    here,
    • H0​: The means of all groups are equal.
    • H1​: At least one group mean is different.
  2. Two-way Anova: It assess the influence of two categorical independent variables on a dependent variable, examining the main effects of each variable and their interaction effect.

Chi-Squared Test

The chi-squared test is a statistical test used to determine if there is a significant association between two categorical variables. It compares the observed frequencies in a contingency table with the frequencies. Formula:
 .

This test is also performed on big data with multiple number of observations.

Non-Parametric Test

Non-parametric test does not make assumptions about the distribution of the data. They are useful when data does not meet the assumptions required for parametric tests.

A/B Testing or Split Testing

A/B testing, also known as split testing, is a method used to compare two versions (A and B) of a webpage, app, or marketing asset to determine which one performs better.

Example : a product manager change a website’s “Shop Now” button color from green to blue to improve the click-through rate (CTR). Formulating null and alternative hypotheses, users are divided into A and B groups, and CTRs are recorded. Statistical tests like chi-square or t-test are applied with a 5% confidence interval. If the p-value is below 5%, the manager may conclude that changing the button color significantly affects CTR, informing decisions for permanent implementation.

Regression

Regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables.

The equation for regression: 

Where,

Regression coefficient is a measure of the strength and direction of the relationship between a predictor variable (independent variable) and the response variable (dependent variable).

Conclusion

In summary, statistics is a vital tool for understanding and utilizing data across various fields. Descriptive statistics simplify and organize data, while inferential statistics allow us to draw conclusions and make predictions based on samples. Measures like central tendency, dispersion, and shape offer insights into data characteristics. Hypothesis testing, confidence intervals, and probability distributions help make informed decisions and analyze relationships between variables. Whether you’re preparing for an interview, exploring data science, or making business choices, a solid grasp of statistics is essential for success in navigating and interpreting the complexities of data.

Statistics Cheat Sheet – FAQs

Is this cheat sheet suitable for Class 10 students?

Yes, this cheat sheet simplifies statistics concepts for easy understanding, suitable for Class 10 students.

Can this Statistics cheat sheets help in machine learning?

Yes, absolutely! Statistics is foundational to machine learning

What are the top 5 fundamental statistics formulas?

Top 5 Fundamental Stats Formulas:

  1. Mean (average): Σxᵢ / n (numerical data)
  2. Median: Middle value (ordered data)
  3. Standard deviation: √(Σ(xᵢ – mean)² / (n – 1)) (numerical data)
  4. Probability: Favorable outcomes / Total possible outcomes
  5. Sample proportion: p = x / n (categorical data)

Article Tags :