Hướng dẫn parametric test python
There are different types of statistical tests used with data, each used to find out a different insight. When we have data into groups and we need to find out a few properties about them, the Student’s t-test is used in such a case. This test is generally used to compare similarities and differences between the two groups. In this article, we will discuss the Student’s t-test in detail starting with its fundamentals. To learn how it can be implemented practically, we will take random data and perform the tests with python and so we will try to learn this concept from scratch. The major points to be covered in this article are listed below. Show Nội dung chính
Nội dung chính
Nội dung chính
Nội dung chính
Table of contents
Let’s start with having a brief introduction to the Student’s t-Test. THE BELAMYSign up for your weekly dose of what's up in emerging technology.What is a Student’s T-test?The t-test is the test of independence, which means to test the relation between two groups that have continuous data stored with the help of mean and standard deviation and have certain features in common. Generally, it tests whether two samples belong to the same population by considering a null hypothesis that their means are equal. This is a statistical test of hypothesis where the test statistic follows a Student’s t-distribution under the null hypothesis. Types of T-test?There are three types of t-test:
The mathematical formula of the one sample t-test is given as: t = ( x̄ – μ) / (s / √n) Where, t = Student’s t-test m =mean of the sample μ = theoretical mean of the population s =standard deviation of the sample n = sample size As observed above, there are two types of mean that are in the formula: population mean and sample mean. Let’s understand their significance.
In case of two-sample test, the following formula is applied: Có thể bạn quan tâmt = ( x̄1– x̄2) / √ [(s12 / n1) + (s22 / n2)] Where, x̄1 = Observed Mean of 1stSample x̄2 = Observed Mean of 2ndSample s1 = Standard Deviation of 1stSample s2 = Standard Deviation of 2ndSample n1 = Size of 1stSample n2 = Size of 2ndSample Where and how do we use the T-test?As a t-test is a parametric test of difference, we use it where we need to check the correlation between the group of continuous data. It can only be used for two groups of data. If you want to compare more than two groups of data then use the ANOVA test or post-hoc test. Before moving to how we use it let’s first discuss some assumptions. As it is a parametric test, the assumptions would be the same as other parametric tests. The t-test assumes your data:
How to use the t-test?Since t-test is a statistical test of a hypothesis, we need to define this hypothesis. Null (H0) and alternate (H1) hypothesis which varies according to the type of t-test.
Problem: Compare population mean and sample mean.
Problem: Compare mean of two groups.
Problem: Compare means between two groups that are paired.
Implementation in pythonThe goal of this section is to show how to implement different types of t-tests. As we know there are three types of t-test, so let us start these tests one by one.
Determine the hypothesis
We will import libraries that would be used. import numpy as np from scipy import stats Next, we will create a random sample or we can read it from a data frame. sample = [183, 152, 178, 157, 194, 163, 144, 114, 178, 152, 118, 158, 172, 138] pop_mean = 165 I have created a random sample stored in a variable sample and defined the population mean in the variable pop_mean. Let’s calculate mean for sample and standard error for the sample. mean = np.mean(sample) std_error = np.std(sample) / np.sqrt(len(sample)) sample mean : 157.21428571428572 standard error: 6.034914208534632 According to the formula we need sample mean and standard error. The formula for the standard error is: standard error= (s / √n) Where, s =standard deviation of the sample n = sample size np.std() is used for calculating the standard deviation np.sqrt(len()) is used to calculate the square root of sample size Let’s calculate the t-static, t-critical and p-value for the comparison. # calculate t statistics t = abs(mean - pop_mean) / std_error print('t static:',t) # two-tailed critical value at alpha = 0.05 t_crit = stats.t.ppf(q=0.975, df=13) print("Critical value for t two tailed:",t_crit) # one-tailed critical value at alpha = 0.05 t_crit = stats.t.ppf(q=0.95, df=13) print("Critical value for t one tailed:",t_crit) # get two-tailed p value p_value = 2*(1-stats.t.cdf(x=t, df=13)) print("p-value:",p_value) t static: 1.2901118419717794 Critical value for t two tailed: 2.1603686564610127 Critical value for t one tailed: 1.7709333959867988 p-value: 0.21948866305060344The above So with the help of the above line of codes, we got the value of t-static, t-critical and p-value. Since the p-value is greater than the alpha value, we are in support of the null hypothesis. Therefore the population mean is greater than the sample mean.
First, we will specify the hypothesis:
We will import libraries that would be used. import numpy as np from scipy import stats Let’s create a random sample or we can read it from the data frame. sample_1=[13.4,10.9,11.2,11.8,14,15.3,14.2,12.6,17,16.2,16.5,15.7] sample_2=[12,11.7,10.7,11.2,14.8,14.4,13.9,13.7,16.9,16,15.6,16] I have created two random samples containing a list of random float numbers. Calculate a mean for sample and variance for both the variance sample1_bar, sample2_bar = np.mean(sample_1), np.mean(sample_2) n1, n2 = len(sample_1), len(sample_2) var_sample1, var_sample2= np.var(sample_1, ddof=1), np.var(sample_2, ddof=1) # pooled sample variance var = ( ((n1-1)*var_sample1) + ((n2-1)*var_sample2) ) / (n1+n2-2) # standard error std_error = np.sqrt(var * (1.0 / n1 + 1.0 / n2)) print("sample_1 mean:",np.round(sample1_bar,4)) print("sample_2 mean:",np.round(sample2_bar,4)) print("variance of sample_1:",np.round(var_sample1,4)) print("variance of sample_2:",np.round(var_sample2,4)) print("pooled sample variance:",var) print("standard error:",std_error) sample_1 mean: 14.0667 sample_2 mean: 13.9083 variance of sample_1: 4.4788 variance of sample_2: 4.3445 pooled sample variance: 4.411628787878788 standard error: 0.8574797167551339 In the above line of codes, we have calculated a mean and variance for the two-sample according to the formula. In the output we can see that there is a slight difference between the variance of two samples, so we need to pool the variance of the sample and then calculate the standard error. Let’s calculate the t-static, t-critical and p-value for the comparison. # calculate t statistics t = abs(sample1_bar - sample2_bar) / std_error print('t static:',t) # two-tailed critical value at alpha = 0.05 t_c = stats.t.ppf(q=0.975, df=12) print("Critical value for t two tailed:",t_c) # one-tailed critical value at alpha = 0.05 t_c = stats.t.ppf(q=0.95, df=12) print("Critical value for t one tailed:",t_c) # get two-tailed p value p_two = 2*(1-stats.t.cdf(x=t, df=12)) print("p-value for two tailed:",p_two) # get one-tailed p value p_one = 1-stats.t.cdf(x=t, df=12) print("p-value for one tailed:",p_one) t static: 0.02623621941624538 Critical value for t two tailed: 2.1788128296634177 Critical value for t one tailed: 1.782287555649159 p-value for two tailed: 0.9795001856247858 p-value for one tailed: 0.4897500928123929 As observed in the output the p value is greater than the alpha value we are in support of the null hypothesis. Therefore there is no significant mean difference between the two samples.
First, we will specify the hypothesis:
Let’s create a random sample or we can read it from the data frame. result_1 = [23, 20, 19, 21, 18, 20, 18, 17, 23, 16, 19] result_2=[ 24, 19, 22, 18, 20, 22, 20, 20, 23, 20, 18] The above sample is the record of students’ marks before and after the tuition. Calculate a mean, standard error, statics, t-critical and p-value mean1, mean2 = np.mean(result_1), np.mean(result_2) n = len(result_1) # sum squared difference between observations d1 = sum([(result_1[i]-result_2[i])**2 for i in range(n)]) # sum difference between observations d2 = sum([result_1[i]-result_2[i] for i in range(n)]) std_dev = np.sqrt((d1 - (d2**2 / n)) / (n - 1)) # standard error of the difference between a mean se = std_dev / np.sqrt(n) t_stat = (mean1 - mean2) / se df = n - 1 # calculate the critical value critical =stats.t.ppf(1.0 - alpha, df) p = (1.0 - stats.t.cdf(abs(t_stat), df)) * 2.0 print(t_stat,critical,p) -1.7073311796734205 1.8124611228107335 0.11856467647601066 Here we have calculated a mean of the sample, then found the sum of squared differences. At last, we found the standard deviation and standard error and also found the t-static, t-critical and p-value. On the basics of the p-value (0.11) > alpha(0.05) we are in support of the null hypothesis. Therefore, we can conclude that there is no change in the result after the tuition. ConclusionWith the help of this article, we could learn the Student’s t-test from the basics. We also had a fair idea of the different types of t-tests with their formulae and assumptions. Along with these, we had a good understanding of how to implement different types of t-tests in python from scratch. References
|