Which function is used to find correlation in python?
Correlation coefficients quantify the association between variables or features of a dataset. These statistics are of high importance for science and technology, and Python has great tools that you can use to calculate them. SciPy, NumPy, and Pandas correlation methods are fast, comprehensive, and well-documented. Show
In this tutorial, you’ll learn:
You’ll start with an explanation of correlation, then see three quick introductory examples, and finally dive into details of NumPy, SciPy and Pandas correlation. CorrelationStatistics and data science are often concerned about the relationships between two or more variables (or features) of a dataset. Each data point in the dataset is an observation, and the features are the properties or attributes of those observations. Every dataset you work with uses variables and observations. For example, you might be interested in understanding the following:
In the examples above, the height, shooting accuracy, years of experience, salary, population density, and gross domestic product are the features or variables. The data related to each player, employee, and each country are the observations. When data is represented in the form of a table, the rows of that table are usually the observations, while the columns are the features. Take a look at this employee table:
In this table, each row represents one observation, or the data about one employee (either Ann, Rob, Tom, or Ivy). Each column shows one property or feature (name, experience, or salary) for all the employees. If you analyze any two features of a dataset, then you’ll find some type of correlation between those two features. Consider the following figures: Each of these plots shows one of three different forms of correlation:
The next figure represents the data from the employee table above: The correlation between experience and salary is positive because higher experience corresponds to a larger salary and vice versa. Correlation is tightly connected to other statistical quantities like the mean, standard deviation, variance, and covariance. If you want to learn more about these quantities and how to calculate them with Python, then check out Descriptive Statistics with Python. There are several statistics that you can use to quantify correlation. In this tutorial, you’ll learn about three correlation coefficients:
Pearson’s coefficient measures linear correlation, while the Spearman and Kendall coefficients compare the ranks of data. There are several NumPy, SciPy, and Pandas correlation functions and methods that you can use to calculate these coefficients. You can also use Matplotlib to conveniently illustrate the results. Example: NumPy Correlation CalculationNumPy has many statistics routines, including >>>
Here, you use Once you have two arrays of the same length, you can call >>>
The
values on the main diagonal of the correlation matrix (upper left and lower right) are equal to 1. The upper left value corresponds to the correlation coefficient for However, what you usually need are the lower left and upper right values of the correlation matrix. These values are equal and both represent the Pearson correlation
coefficient for This figure shows the data points and the correlation coefficients for the above example: The red squares are the data points. As you can see, the figure also shows the values of the three correlation coefficients. Example: SciPy Correlation CalculationSciPy also has many statistics routines contained in
Here’s how you would use these functions in Python: >>>
Note that these functions return objects that contain two values:
You use the p-value in statistical methods when you’re testing a hypothesis. The p-value is an important measure that requires in-depth knowledge of probability and statistics to interpret. To learn more about them, you can read about the basics or check out a data scientist’s explanation of p-values. You can extract the p-values and the correlation coefficients with their indices, as the items of tuples: >>>
You could also use dot notation for the Spearman and Kendall coefficients: >>>
The dot notation is longer, but it’s also more readable and more self-explanatory. If you want to get the Pearson correlation coefficient and p-value at the same time, then you can unpack the return value: >>>
This approach exploits Python unpacking and the fact that Example: Pandas Correlation CalculationPandas is, in some cases, more convenient than NumPy and SciPy for calculating statistics. It offers statistical methods for >>>
Here,
you use
The callable can be any function, method, or object with Linear CorrelationLinear correlation measures the proximity of the mathematical relationship between variables or dataset features to a linear function. If the relationship between the two features is closer to some linear function, then their linear correlation is stronger and the absolute value of the correlation coefficient is higher. Pearson Correlation CoefficientConsider a dataset with two features: x and y. Each feature has n values, so x and y are n-tuples. Say that the first value x₁ from x corresponds to the first value y₁ from y, the second value x₂ from x to the second value y₂ from y, and so on. Then, there are n pairs of corresponding values: (x₁, y₁), (x₂, y₂), and so on. Each of these x-y pairs represents a single observation. The Pearson (product-moment) correlation coefficient is a measure of the linear relationship between two features. It’s the ratio of the covariance of x and y to the product of their standard deviations. It’s often denoted with the letter r and called Pearson’s r. You can express this value mathematically with this equation: r = Σᵢ((xᵢ − mean(x))(yᵢ − mean(y))) (√Σᵢ(xᵢ − mean(x))² √Σᵢ(yᵢ − mean(y))²)⁻¹ Here, i takes on the values 1, 2, …, n. The mean values of x and y are denoted with mean(x) and mean(y). This formula shows that if larger x values tend to correspond to larger y values and vice versa, then r is positive. On the other hand, if larger x values are mostly associated with smaller y values and vice versa, then r is negative. Here are some important facts about the Pearson correlation coefficient:
The above facts can be summed up in the following table:
In short, a larger absolute value of r indicates stronger correlation, closer to a linear function. A smaller absolute value of r indicates weaker correlation. Linear Regression: SciPy ImplementationLinear regression is the process of finding the linear function that is as close as possible to the actual relationship between features. In other words, you determine the linear function that best describes the association between the features. This linear function is also called the regression line. You can implement linear regression with SciPy. You’ll get the linear function that best approximates the relationship between two arrays, as well as the Pearson correlation coefficient. To get started, you first need to import the libraries and prepare some data to work with: >>>
Here, you import You can use
>>>
That’s it! You’ve completed the linear regression and gotten the following results:
You’ll learn how to visualize these results in a later section. You can also provide a single argument to >>>
The result is exactly the same as the previous example because
Here’s how you might
transpose >>>
Now that you know how to get the transpose, you can pass one to >>>
Here, you use You should also be careful to note whether or not your dataset contains missing values. In data science and machine learning, you’ll often find some missing or corrupted data. The usual way to represent it in Python, NumPy, SciPy, and Pandas is by using NaN or Not a Number values. But if your data contains >>>
In this case, your resulting object returns all
You can also check whether a variable corresponds to Pearson Correlation: NumPy and SciPy ImplementationYou’ve already seen how to get the Pearson correlation coefficient with >>>
Note that if you provide an array with a There are few additional details worth considering. First, recall that >>>
The results are the same in this and previous examples. Again, the first row of
If you want to get the correlation coefficients for three features, then you just provide a numeric two-dimensional array with three rows as the argument: >>>
You’ll obtain the correlation matrix again, but this one will be larger than previous ones:
This is because Here’s an interesting example of what happens when you pass >>>
In this example, the first two rows (or features) of By default, >>>
This array is identical to the one you saw earlier. Here, you apply a different convention, but the result is the same. Pearson Correlation: Pandas ImplementationSo far, you’ve used
>>>
You now have three You’ve already learned how to use >>>
Here, you call If you provide a >>>
You get the same value of the
correlation coefficient in these two examples. That’s because You can also use >>>
The resulting correlation
matrix is a new instance of >>>
This example shows two ways of accessing values:
You can apply >>>
You’ll get a correlation matrix with the following correlation coefficients:
Another useful method is >>>
In this case, the result is a new
Both Rank CorrelationRank correlation compares the ranks or the orderings of the data related to two variables or dataset features. If the orderings are similar, then the correlation is strong, positive, and high. However, if the orderings are close to reversed, then the correlation is strong, negative, and low. In other words, rank correlation is concerned only with the order of values, not with the particular values from the dataset. To illustrate the difference between linear and rank correlation, consider the following figure: The left plot has a perfect positive linear relationship between x and y, so r = 1. The central plot shows positive correlation and the right one shows negative correlation. However, neither of them is a linear function, so r is different than −1 or 1. When you look only at the orderings or ranks, all three relationships are perfect! The left and central plots show the observations where larger x values always correspond to larger y values. This is perfect positive rank correlation. The right plot illustrates the opposite case, which is perfect negative rank correlation. Spearman Correlation CoefficientThe Spearman correlation coefficient between two features is the Pearson correlation coefficient between their rank values. It’s calculated the same way as the Pearson correlation coefficient but takes into account their ranks instead of their values. It’s often denoted with the Greek letter rho (ρ) and called Spearman’s rho. Say you have two n-tuples, x and y, where Here are some important facts about the Spearman correlation coefficient:
You can calculate Spearman’s rho in Python in a very similar way as you would Pearson’s r. Kendall Correlation CoefficientLet’s start again by considering two
n-tuples, x and y. Each of the x-y pairs
The Kendall correlation coefficient compares the number of concordant and discordant pairs of data. This coefficient is based on the difference in the counts of concordant and discordant pairs relative to the number of x-y pairs. It’s often denoted with the Greek letter tau (τ) and called Kendall’s tau. According to the
If a tie occurs in both x and y, then it’s not included in either nˣ or nʸ. The Wikipedia page on Kendall rank correlation coefficient gives the following expression: τ = (2 / (n(n − 1))) Σᵢⱼ(sign(xᵢ − xⱼ) sign(yᵢ − yⱼ)) for i < j, where i = 1, 2, …, n − 1 and j = 2, 3, …, n. The sign function sign(z) is −1 if z < 0, 0 if z = 0, and 1 if z > 0. n(n − 1) / 2 is the total number of x-y pairs. Some important facts about the Kendall correlation coefficient are as follows:
You can calculate Kendall’s tau in Python similarly to how you would calculate Pearson’s r. Rank: SciPy ImplementationYou can use >>>
Now that you’ve prepared data, you can determine the rank of each value in a NumPy array with >>>
The arrays
>>>
There are two elements with a value of
>>>
In this case, the value >>>
Rank Correlation: NumPy and SciPy ImplementationYou can calculate the Spearman correlation coefficient with >>>
You can get the same result if you provide the two-dimensional array >>>
The first row of Another optional parameter
If you provide a two-dimensional array with more than two features, then you’ll get the correlation matrix and the matrix of the p-values: >>>
The value You can obtain the Kendall correlation coefficient with >>>
However, if you provide only one two-dimensional array as an argument, then Rank Correlation: Pandas ImplementationYou can calculate the Spearman and Kendall correlation coefficients with Pandas. Just like before, you start by importing >>>
Now that you have these Pandas objects, you can
use To calculate Spearman’s rho, pass >>>
If you want Kendall’s tau, then you use >>>
As you can see, unlike with SciPy, you can use a single two-dimensional data structure (a dataframe). Visualization of CorrelationData visualization is very important in statistics and data science. It can help you better understand your data and give you a better insight into the relationships between features. In this section, you’ll learn how to visually represent the relationship between two features with an x-y plot. You’ll also use heatmaps to visualize a correlation matrix. You’ll learn how to prepare data and get certain visual representations, but you won’t cover many other explanations. To learn more about Matplotlib in-depth, check out Python Plotting With Matplotlib (Guide). You can also take a look at the official documentation and Anatomy of Matplotlib. To get started, first import >>>
Here, you
use You’ll use the arrays >>>
Now that you’ve got your data, you’re ready to plot. X-Y Plots With a Regression LineFirst, you’ll see how to create an x-y plot with the regression line, its equation, and the Pearson correlation coefficient. You can get the slope and the intercept of the regression line, as well as the correlation coefficient, with >>>
Now you have all the values you need. You can also get the string with the equation of the regression line and the value of the correlation coefficient. f-strings are very convenient for this purpose: >>>
Now, create the x-y plot with
Your output should look like this: The red squares represent the observations, while the blue line is the regression line. Its equation is listed in the legend, together with the correlation coefficient. Heatmaps of Correlation MatricesThe correlation matrix can become really big and confusing when you have a lot of features! Fortunately, you can present it visually as a heatmap where each field has the color that corresponds to its value. You’ll need the correlation matrix: >>>
It can be convenient for you to round the numbers in the correlation matrix with Finally, create your heatmap with
Your output should look like this: The result is a table with the coefficients. It sort of looks like the Pandas output with colored backgrounds. The colors help you interpret the output. In this example, the yellow color represents the number 1, green corresponds to 0.76, and purple is used for the negative numbers. ConclusionYou now know that correlation coefficients are statistics that measure the association between variables or features of datasets. They’re very important in data science and machine learning. You can now use Python to calculate:
Now you can use NumPy, SciPy, and Pandas correlation functions and methods to effectively calculate these (and other) statistics, even when you work with large datasets. You also know how to visualize data, regression lines, and correlation matrices with Matplotlib plots and heatmaps. If you have any questions or comments, please put them in the comments section below! What is correlation function in Python?correlation() method in Python is used to return Pearson's correlation coefficient between two inputs. The Pearson's correlation formula is: Where: r = The correlation coefficient. It is usually between -1 (negative correlation) and +1 (positive correlation).
Which function is used to find correlation?The CORREL function returns the correlation coefficient of two cell ranges. Use the correlation coefficient to determine the relationship between two properties.
Which function is used to find correlation in Pandas?corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python.
What does Corr () return?The corr() aggregate function returns a coefficient of correlation between two numbers.
|