Hướng dẫn parametric test python

There are different types of statistical tests used with data, each used to find out a different insight. When we have data into groups and we need to find out a few properties about them, the Student’s t-test is used in such a case. This test is generally used to compare similarities and differences between the two groups. In this article, we will discuss the Student’s t-test in detail starting with its fundamentals. To learn how it can be implemented practically, we will take random data and perform the tests with python and so we will try to learn this concept from scratch. The major points to be covered in this article are listed below.

Nội dung chính

  • Table of contents
  • What is a Student’s T-test?
  • Types of T-test?
  • Where and how do we use the T-test?
  • How to use the t-test?
  • Implementation in python
  • Conclusion

Nội dung chính

  • Table of contents
  • What is a Student’s T-test?
  • Types of T-test?
  • Where and how do we use the T-test?
  • How to use the t-test?
  • Implementation in python
  • Conclusion

Nội dung chính

  • Table of contents
  • What is a Student’s T-test?
  • Types of T-test?
  • Where and how do we use the T-test?
  • How to use the t-test?
  • Implementation in python
  • Conclusion

Nội dung chính

  • Table of contents
  • What is a Student’s T-test?
  • Types of T-test?
  • Where and how do we use the T-test?
  • How to use the t-test?
  • Implementation in python
  • Conclusion

Table of contents

  1. What is a Student’s T-test?
  2. Types of T-test
  3. Where and how do we use the T-test?
  4. Implementing T-test using python

Let’s start with having a brief introduction to the Student’s t-Test.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

What is a Student’s T-test?

The t-test is the test of independence, which means to test the relation between two groups that have continuous data stored with the help of mean and standard deviation and have certain features in common. Generally, it tests whether two samples belong to the same population by considering a null hypothesis that their means are equal. This is a statistical test of hypothesis where the test statistic follows a Student’s t-distribution under the null hypothesis.

Types of T-test?

There are three types of t-test:

  1. One sample t-test: We use this test when there is a single sample and need a comparison between the population and sample to test whether the sample belongs to the given population.
  2. Two sample t-test: When there are two groups of samples and we want to compare them to find whether they belong to the same population then this test is implemented.
  3. Paired t-test: It is used when the samples are paired. For example, we are experimenting with a new exercise on people and we took samples of after and before effects of the exercise. Now we want to find out whether this exercise is beneficial or not. We will use paired t-tests to find out the differences because the samples are connected.

The mathematical formula of the one sample t-test is given as:

                                         t = ( x̄ – μ) / (s / √n)

Where, 

t  =  Student’s t-test

m =mean of the sample

μ  = theoretical mean of the population

s  =standard deviation of the sample

n  = sample size

As observed above, there are two types of mean that are in the formula: population mean and sample mean. Let’s understand their significance.

  • Population mean is a mean of the population i.e the data from where we are taking the samples.
  • Sample mean is a mean of the sample on which the test is being conducted.

In case of two-sample test, the following formula is applied:

                              t = ( x̄1– x̄2) / √ [(s12 / n1) + (s22 / n2)]

Where,

x̄1  = Observed Mean of 1stSample

x̄2  = Observed Mean of 2ndSample

s1  = Standard Deviation of 1stSample 

s2  =  Standard Deviation of 2ndSample

n1  = Size of 1stSample 

n2  = Size of 2ndSample

Where and how do we use the T-test?

As a t-test is a parametric test of difference, we use it where we need to check the correlation between the group of continuous data. It can only be used for two groups of data. If you want to compare more than two groups of data then use the ANOVA test or post-hoc test. Before moving to how we use it let’s first discuss some assumptions. 

As it is a parametric test, the assumptions would be the same as other parametric tests. The t-test assumes your data:

  1. Are independent,
  2. Have a normal distribution (approx), and
  3. Have homogenous variance (amount of variance is same in the data)

How to use the t-test?

Since t-test is a statistical test of a hypothesis, we need to define this hypothesis. Null (H0) and alternate (H1) hypothesis which varies according to the type of t-test.

  1. Hypothesis for One sample t-test:

Problem: Compare population mean and sample mean.

  • H0 : There is no significant mean difference between population and sample mean (μ = x).
  • H1 : There is a significant mean difference between population and sample mean (μ != x).
  1. Hypothesis for Two-sample t-test:

Problem: Compare mean of two groups.

  • H0: There is no significant mean difference between the groups  (μ1= μ2)
  • H1: There is significant mean difference between the groups (μ1!= μ2)
  1. Hypothesis for paired sample t-test:

Problem: Compare means between two groups that are paired.

  • H0: There is no significant mean difference between the two sample mean  (x̄1= x̄2)
  • H1: There is significant mean difference between the two sample mean       (x̄1!= x̄2)

Implementation in python

The goal of this section is to show how to implement different types of t-tests. As we know there are three types of t-test, so let us start these tests one by one.

  1. One sample t-test

Determine the hypothesis

  • H0 : population mean is greater or equal to sample mean (μ> = x)
  • H1 : population mean is less than sample mean (μ< x)

We will import libraries that would be used.

import numpy as np
from scipy import stats 

Next, we will create a random sample or we can read it from a data frame.

sample = [183, 152, 178, 157, 194, 163, 144, 114, 178, 152, 118, 158, 172, 138]
pop_mean = 165

I have created a random sample stored in a variable sample and defined the population mean in the variable pop_mean. Let’s calculate mean for sample and standard error for the sample.

mean = np.mean(sample)
std_error = np.std(sample) / np.sqrt(len(sample))
sample mean : 157.21428571428572
standard error: 6.034914208534632

According to the formula we need sample mean and standard error. The formula for the standard error is:

                                     standard error= (s / √n)

Where, 

s  =standard deviation of the sample

n  = sample size

np.std() is used for calculating the standard deviation

np.sqrt(len()) is used to calculate the square root of sample size

Let’s calculate the t-static, t-critical and p-value for the comparison.

# calculate t statistics
t = abs(mean - pop_mean) / std_error
print('t static:',t)
# two-tailed critical value at alpha = 0.05
t_crit = stats.t.ppf(q=0.975, df=13)
print("Critical value for t two tailed:",t_crit)
 
# one-tailed critical value at alpha = 0.05
t_crit = stats.t.ppf(q=0.95, df=13)
print("Critical value for t one tailed:",t_crit)
 
 
# get two-tailed p value
p_value = 2*(1-stats.t.cdf(x=t, df=13))
print("p-value:",p_value)

t static: 1.2901118419717794

Critical value for t two tailed: 2.1603686564610127

Critical value for t one tailed: 1.7709333959867988

p-value: 0.21948866305060344The above

So with the help of the above line of codes, we got the value of t-static, t-critical and p-value. Since the p-value is greater than the alpha value, we are in support of the null hypothesis. Therefore the population mean is greater than the sample mean.

  1. Two sample t-test

First, we will specify the hypothesis:

  • H0 : There is no significant mean difference (μ1= μ2)
  • H1 : There is significant mean difference (μ1!= μ2)

We will import libraries that would be used.

import numpy as np
from scipy import stats 

Let’s create a random sample or we can read it from the data frame.

sample_1=[13.4,10.9,11.2,11.8,14,15.3,14.2,12.6,17,16.2,16.5,15.7]
sample_2=[12,11.7,10.7,11.2,14.8,14.4,13.9,13.7,16.9,16,15.6,16]

I have created two random samples containing a list of random float numbers. Calculate a mean for sample and variance for both the variance

sample1_bar, sample2_bar = np.mean(sample_1), np.mean(sample_2)
n1, n2 = len(sample_1), len(sample_2)
var_sample1, var_sample2= np.var(sample_1, ddof=1), np.var(sample_2, ddof=1)
# pooled sample variance
var = ( ((n1-1)*var_sample1) + ((n2-1)*var_sample2) ) / (n1+n2-2)
# standard error
std_error = np.sqrt(var * (1.0 / n1 + 1.0 / n2))
 
print("sample_1 mean:",np.round(sample1_bar,4))
print("sample_2 mean:",np.round(sample2_bar,4))
print("variance of sample_1:",np.round(var_sample1,4))
print("variance of sample_2:",np.round(var_sample2,4))
print("pooled sample variance:",var)
print("standard error:",std_error)

sample_1 mean: 14.0667

sample_2 mean: 13.9083

variance of sample_1: 4.4788

variance of sample_2: 4.3445

pooled sample variance: 4.411628787878788

standard error: 0.8574797167551339

In the above line of codes, we have calculated a mean and variance for the two-sample according to the formula. In the output we can see that there is a slight difference between the variance of two samples, so we need to pool the variance of the sample and then calculate the standard error.

Let’s calculate the t-static, t-critical and p-value for the comparison.

# calculate t statistics
t = abs(sample1_bar - sample2_bar) / std_error
print('t static:',t)
# two-tailed critical value at alpha = 0.05
t_c = stats.t.ppf(q=0.975, df=12)
print("Critical value for t two tailed:",t_c)
 
 
# one-tailed critical value at alpha = 0.05
t_c = stats.t.ppf(q=0.95, df=12)
print("Critical value for t one tailed:",t_c)
 
 
# get two-tailed p value
p_two = 2*(1-stats.t.cdf(x=t, df=12))
print("p-value for two tailed:",p_two)
 
# get one-tailed p value
p_one = 1-stats.t.cdf(x=t, df=12)
print("p-value for one tailed:",p_one)

t static: 0.02623621941624538

Critical value for t two tailed: 2.1788128296634177

Critical value for t one tailed: 1.782287555649159

p-value for two tailed: 0.9795001856247858

p-value for one tailed: 0.4897500928123929

As observed in the output the p value is greater than the alpha value we are in support of the null hypothesis. Therefore there is no significant mean difference between the two samples.

  1. Paired t-test

First, we will specify the hypothesis:

  • H0 : There is no change after the tuition (x̄1= x̄2)
  • H1 : There is a change after the tuition (x̄1!= x̄2)

Let’s create a random sample or we can read it from the data frame.

result_1 = [23, 20, 19, 21, 18, 20, 18, 17, 23, 16, 19]
result_2=[ 24, 19, 22, 18, 20, 22, 20, 20, 23, 20, 18]

The above sample is the record of students’ marks before and after the tuition. Calculate a mean, standard error, statics, t-critical and p-value

mean1, mean2 = np.mean(result_1), np.mean(result_2)
n = len(result_1)
# sum squared difference between observations
d1 = sum([(result_1[i]-result_2[i])**2 for i in range(n)])
# sum difference between observations
d2 = sum([result_1[i]-result_2[i] for i in range(n)])
std_dev = np.sqrt((d1 - (d2**2 / n)) / (n - 1))
# standard error of the difference between a mean
se = std_dev / np.sqrt(n)
t_stat = (mean1 - mean2) / se
df = n - 1
# calculate the critical value
critical =stats.t.ppf(1.0 - alpha, df)
p = (1.0 - stats.t.cdf(abs(t_stat), df)) * 2.0
print(t_stat,critical,p)

-1.7073311796734205 1.8124611228107335 0.11856467647601066

Here we have calculated a mean of the sample, then found the sum of squared differences. At last, we found the standard deviation and standard error and also found the t-static, t-critical and p-value. On the basics of the p-value (0.11) > alpha(0.05) we are in support of the null hypothesis. Therefore, we can conclude that there is no change in the result after the tuition.

Conclusion

With the help of this article, we could learn the Student’s t-test from the basics. We also had a fair idea of the different types of t-tests with their formulae and assumptions. Along with these, we had a good understanding of how to implement different types of t-tests in python from scratch.

References

  1. Link for  above code
  2. Scipy Documentation