Hướng dẫn welch t-test in python
Welch’s t-test is a nonparametric univariate test that tests for a significant difference between the mean of two unrelated groups. It is an alternative to the independent t-test when there is a violation in the assumption of equality of variances. The hypothesis being
tested is: If the p-value is less than what is tested at, most commonly 0.05, one can reject the null hypothesis. Like every test, this inferential statistic test has assumptions. The assumptions that the data must meet in order for the test results to be valid are: If
any of these assumptions are violated then another test should be used. The data used in this example is from Kaggle.com and was posted by the user Web IR. The link to the data set is here. The data set contains the sepal and petal length and width of various floral species. We will be testing to see
if there is a significant difference in the petal lenght between the species Iris-setosa and Iris-virginica which are variables “petal_length” and “species” respectively. Let’s import pandas as pd, the data, and then take a look at what we will be working with! import pandas as pd df= pd.read_csv("Iris_Data.csv") df.groupby("species")['petal_length'].describe()
To make the code in the next steps a bit cleaner to read, I will create 2 data frames that are subsets of the original data where each data frame only contains data for a respective flower species. setosa = df[(df['species'] == 'Iris-setosa')] virginica = df[(df['species'] == 'Iris-virginica')] Welch’s t-test ExampleThe first thing we need to do is import scipy.stats as stats and then test our assumptions. We can test the assumption of normality using the stats.shapiro(). Unfortunately, the output is not labeled. The first value in the tuple is the W test statistic, and the second value is the p-value. from scipy import stats stats.shapiro(setosa['petal_length']) (0.9549458622932434, 0.05464918911457062) stats.shapiro(virginica['petal_length']) (0.9621862769126892, 0.10977369546890259) Neither of the variables of interest violates the assumption of normality so we can continue with our analysis plan. To conduct a Welch’s t-test, one needs to use the stats.ttest_ind() method while passing “False” in the “equal_var=” argument. stats.ttest_ind(setosa['petal_length'], virginica['petal_length'], equal_var = False) Ttest_indResult(statistic=-49.965703359355636, pvalue=9.7138670616970964e-50) The p-value is significant, therefore one can reject the null hypothesis in support of the alternative. Another piece of information you will need to report is the degrees of freedom (DoF). However, there is not a built-in method for this currently. Below are 2 functions that will give you what you need. The first, only calculates the DoF as a two tail test and returns it. The second, conducts the Welch’s test, calculates the DoF as a two tail test, and returns all the needed information. def welch_dof(x,y): dof = (x.var()/x.size + y.var()/y.size)**2 / ((x.var()/x.size)**2 / (x.size-1) + (y.var()/y.size)**2 / (y.size-1)) print(f"Welch-Satterthwaite Degrees of Freedom= {dof:.4f}") welch_dof(setosa['petal_length'], virginica['petal_length']) Welch-Satterthwaite Degrees of Freedom= 58.5928 def welch_ttest(x, y): ## Welch-Satterthwaite Degrees of Freedom ## dof = (x.var()/x.size + y.var()/y.size)**2 / ((x.var()/x.size)**2 / (x.size-1) + (y.var()/y.size)**2 / (y.size-1)) t, p = stats.ttest_ind(x, y, equal_var = False) print("\n", f"Welch's t-test= {t:.4f}", "\n", f"p-value = {p:.4f}", "\n", f"Welch-Satterthwaite Degrees of Freedom= {dof:.4f}") welch_ttest(setosa['petal_length'], virginica['petal_length']) Welch’s t-test= -49.9657 Welch’s t-test InterpretationThe current study aimed to test if there was a significant difference in the petal length between the floral species Setosa and Virginica. Setosa has shorter petal length (M= 1.464 units, SD= 0.174 units) compared to Virginica (M= 5.552 units, SD= 0.552 units). Welch’s t-test was selected to analyze the data because Levene’s test for homogeneity of variances indicated unequal variances between groups (F= 39.977, p< 0.0001). The difference in petal length between the two species is significantly different (Welch's t(-49.966)= 58.593, p< 0.0001). |