# Understanding Hypothesis Testing

Statistics is an important part of data science where we use statical assumptions to get assertions from population data, to make assumptions from the population we make hypothesis about population parameters. **A hypothesis is** a statement about a given problem.

## What is Hypothesis Testing

Hypothesis testing is a statistical method that is used in making a statistical decision using experimental data. Hypothesis testing is basically an assumption that we make about a population parameter. It evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data.

**Example:** You say an average student in the class is 30 or a boy is taller than a girl. All of these is an assumption that we are assuming and we need some statistical way to prove these. We need some mathematical conclusion whatever we are assuming is true.

**Need for Hypothesis Testing**

Hypothesis testing is an important procedure in statistics. Hypothesis testing evaluates two mutually exclusive population statements to determine which statement is most supported by sample data. When we say that the findings are statistically significant, it is thanks to hypothesis testing.

**Parameters of hypothesis testing**

**Null hypothesis(H0):**In statistics, the null hypothesis is a general given statement or default position that there is no relationship between two measured cases or no relationship among groups. In other words, it is a basic assumption or made based on the problem knowledge.

Example: A company production is = 50 units/per day etc.

**Alternative hypothesis(H1):**The alternative hypothesis is the hypothesis used in hypothesis testing that is contrary to the null hypothesis.

Example: A company’s production is not equal to 50 units/per day etc.

**Level of significance**It refers to the degree of significance in which we accept or reject the null hypothesis. 100% accuracy is not possible for accepting a hypothesis, so we, therefore, select a level of significance that is usually 5%. This is normally denoted with and generally, it is 0.05 or 5%, which means your output should be 95% confident to give a similar kind of result in each sample.**P-value**The P value, or calculated probability, is the probability of finding the observed/extreme results when the null hypothesis(H0) of a study-given problem is true. If your P-value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample claims to support the alternative hypothesis.

### Steps in Hypothesis Testing

**Step 1**– We first identify the problem about which we want to make an assumption keeping in mind that our assumption should be contradictory to one another**Step 2**– We consider statical assumption such that the data is normal or not, statical independence between the data.**Step 3**– We decide our test data on which we will check our hypothesis**Step 4**– The data for the tests are evaluated in this step we look for various scores in this step like z-score and mean values.**Step 5 –**In this stage, we decide where we should accept the null hypothesis or reject the null hypothesis

**Example: **Given a coin and it is not known whether that is fair or tricky so let’s decide the null and alternate hypothesis

- Null Hypothesis(H0): a coin is a fair coin.
- Alternative Hypothesis(H1): a coin is a tricky coin.
- =
- Toss a coin 1st time and assume that the result is head- P-value = (as head and tail have equal probability)
- Toss a coin 2nd time and assume that result again is head, now p-value =

and similarly, we Toss 6 consecutive times and got the result as all heads, now P-value = But we set our significance level as an error rate we allow and here we see we are beyond that level i.e. our null- hypothesis does not hold good so we need to reject and propose that this coin is a tricky coin which is actually because it gives us 6 consecutive heads.

### Formula For Hypothesis Testing

To validate our hypothesis about a population parameter we use statistical functions. we use the z-score, p-value, and, level of significance(alpha) to make evidence for our hypothesis.

where,

is the sample mean,

μ represents the population mean,

σ is the standard deviation and

n is the size of the sample.

## Python Implementation of Hypothesis Testing

We will use the scipy python library to compute the p-value and z-score for our sample dataset. Scipy is a mathematical library in Python that is mostly used for mathematical equations and computations. In this code, we will create a function hypothesis_test in which we will pass arguments like pop_mean(population parameter upon which we are checking our hypothesis), sample dataset, level of confidence(alpha value), and type of testing (whether it’s a one-tailed test or two-tailed test).

The information we are using in this Hypothesis test is

Level of confidence –0.5

Null hypothesis –population mean = 5.0

Alternative hypothesis– population mean != 5.0

## Python3

`import` `numpy as np` `from` `scipy.stats ` `import` `norm` `def` `hypothesis_test(sample, pop_mean,` ` ` `alpha` `=` `0.05` `, two_tailed` `=` `True` `):` ` ` `# len sample dataset` ` ` `n ` `=` `len` `(sample)` ` ` `# mean and stard-deviation of dataset` ` ` `sample_mean ` `=` `np.mean(sample)` ` ` `sample_std ` `=` `np.std(sample, ddof` `=` `1` `)` ` ` `# Calculate the test statistic` ` ` `z ` `=` `(sample_mean ` `-` `pop_mean) ` `/` `(sample_std ` `/` `np.sqrt(n))` ` ` `# Calculate the p-value based on the test type` ` ` `if` `two_tailed:` ` ` `p_value ` `=` `2` `*` `(` `1` `-` `norm.cdf(` `abs` `(z)))` ` ` `else` `:` ` ` `if` `z < ` `0` `:` ` ` `p_value ` `=` `norm.cdf(z)` ` ` `else` `:` ` ` `p_value ` `=` `1` `-` `norm.cdf(z)` ` ` `# Determine whether to reject or fail to` ` ` `# reject the null hypothesis` ` ` `if` `p_value < alpha:` ` ` `result ` `=` `"reject"` ` ` `else` `:` ` ` `result ` `=` `"fail to reject"` ` ` `return` `z, p_value, result` |

### Evaluate Hypothesis Function on Sample Dataset

To evaluate our hypothesis test function we will create a sample dataset of 20 points having 4.5 as the mean and 2 as the standard deviation. Here, We will consider that our population has a mean equals to 5 .

## Python3

`np.random.seed(` `0` `)` `sample ` `=` `np.random.normal(loc` `=` `4.5` `, scale` `=` `2` `, size` `=` `20` `)` `pop_mean ` `=` `5.0` `# Test the null hypothesis that` `# the population mean is equal to 5.0` `z, p_value, result ` `=` `hypothesis_test(sample, pop_mean)` `print` `(f` `"Test statistic: {z:.4f}"` `)` `print` `(f` `"P-value: {p_value:.4f}"` `)` `print` `(f` `"Result: {result} null hypothesis at alpha=0.05"` `)` |

**Output :**

Test statistic: 1.6372 P-value: 0.1016 Result: fail to reject null hypothesis at alpha=0.05

In the above example, we can see that we are getting a p-value of 0.101 from the dataset which is less than our level of confidence(alpha value) which is 0.5 hence in this case we will reject our null hypothesis the population mean is 5.0

What if we get a p-value greater than our test statistics but we still reject our null hypothesis in this case we will be making an error. Based on the error we make we define error in two types.

**Error in Hypothesis Testing**

**Type I error:**When we reject the null hypothesis, although that hypothesis was true. Type I error is denoted by alpha.**Type II errors:**When we accept the null hypothesis but it is false. Type II errors are denoted by beta.

## Please

Loginto comment...