Calculate p value logistic regression python

$\begingroup$

I am building a multinomial logistic regression with sklearn (LogisticRegression). But after it finishes, how can I get a p-value and confident interval of my model? It only appears that sklearn only provides coefficient and intercept.

Thank you a lot.

Calculate p value logistic regression python

asked Nov 28, 2016 at 17:10

$\endgroup$

$\begingroup$

answered Nov 28, 2016 at 17:23

Calculate p value logistic regression python

HobbesHobbes

1,4098 silver badges15 bronze badges

$\endgroup$

0

$\begingroup$

One way to get confidence intervals is to bootstrap your data, say, $B$ times and fit logistic regression models $m_i$ to the dataset $B_i$ for $i = 1, 2, ..., B$. This gives you a distribution for the parameters you are estimating, from which you can find the confidence intervals.

answered Nov 28, 2016 at 19:00

darXiderdarXider

5831 gold badge4 silver badges12 bronze badges

$\endgroup$

$\begingroup$

This is still not implemented and not planned as it seems out of scope of sklearn, as per Github discussion #6773 and #13048.

However, the documentation on linear models now mention that (P-value estimation note):

  • It is theoretically possible to get p-values and confidence intervals for coefficients in cases of regression without penalization.
  • The statsmodels package natively supports this.
  • Within sklearn, one could use bootstrapping.

It appears that it is possible to modify the LinearRegression class to calculate p-values from linear algebra, as per this Github code.

answered Mar 7, 2020 at 19:14

lcrmorinlcrmorin

2,2045 gold badges17 silver badges37 bronze badges

$\endgroup$

In this Python tutorial, we will learn about scikit-learn logistic regression and we will also cover different examples related to scikit-learn logistic regression. And, we will cover these topics.

  • Scikit-learn logistic regression
  • Scikit-learn logistic regression standard errors
  • Scikit-learn logistic regression coefficients
  • Scikit-learn logistic regression p value
  • Scikit-learn logistic regression feature importance
  • Scikit-learn logistic regression categorical variables
  • Scikit-learn logistic regression cross-validation
  • Scikit-learn logistic regression threshold

In this section, we will learn about how to work with logistic regression in scikit-learn.

  • Logistic regression is a statical method for preventing binary classes or we can say that logistic regression is conducted when the dependent variable is dichotomous.
  • Dichotomous means there are two possible classes like binary classes (0&1).
  • Logistic regression is used for classification as well as regression. It computes the probability of an event occurrence.

Code:

Here in this code, we will import the load_digits data set with the help of the sklearn library. The data is inbuilt in sklearn we do not need to upload the data.

from sklearn.datasets import load_digits
digits = load_digits()

We can already import the data with the help of sklearn from this uploaded data from the below command we can see that there are 1797 images and 1797 labels in the dataset.


print('Image Data Shape' , digits.data.shape)

print("Label Data Shape", digits.target.shape

In the following output, we can see that the Image Data Shape value and Label Data Shape value is printing on the screen.

Calculate p value logistic regression python
Importing dataset value

In this part, we will see that how our image and labels look like the images and help to evoke your data.

  • plot.figure(figsize=(30,4)) is used for plotting the figure on the screen.
  • for index, (image, label) in enumerate(zip(digits.data[5:10], digits.target[5:10])): is used to give the perfect size or label to the image.
  • plot.subplot(1, 5, index + 1) is used to plotting the index.
  • plot.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray) is used for reshaping the image.
  • plot.title(‘Set: %i\n’ % label, fontsize = 30) is used to give the title to the image.
import numpy as np 
import matplotlib.pyplot as plot
plot.figure(figsize=(30,4))
for index, (image, label) in enumerate(zip(digits.data[5:10], digits.target[5:10])):
 plot.subplot(1, 5, index + 1)
 plot.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray)
 plot.title('Set: %i\n' % label, fontsize = 30)

After running the above code we get the following output we can see that the image is plotted on the screen in the form of Set5, Set6, Set7, Set8, Set9.

Calculate p value logistic regression python
Enumerate digits target set

In the following code, we are splitting our data into two forms training data and testing data.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=0)

Here we import logistic regression from sklearn .sklearn is used to just focus on modeling the dataset.

from sklearn.linear_model import LogisticRegression

In the below code we make an instance of the model. In here all parameters not specified are set to their defaults.

logisticRegression= LogisticRegression()

Above we split the data into two sets training and testing data. We can train the model after training the data we want to test the data

logisticRegression.fit(x_train, y_train)

The model can be learned during the model training process and predict the data from one observation and return the data in the form of an array.

logisticRegression.predict(x_test[0].reshape(1,-1)

In the following output, we see the NumPy array is returned after predicting for one observation.

Calculate p value logistic regression python
Return the array

From the below code we can predict that multiple observations at once.

logisticRegression.predict(x_test[0:10])

From this code, we can predict the entire data.

logisticRegression.predict(x_test[0:10])

After training and testing our model is ready or not to find that we can measure the accuracy of the model we can use the scoring method to get the accuracy of the model.

predictions = logisticRegression.predict(x_test)
score = logisticRegression.score(x_test, y_test)
print(score)

In this output, we can get the accuracy of a model by using the scoring method.

Calculate p value logistic regression python
Predict the accuracy of a model

Also, check: Scikit learn Decision Tree

Scikit-learn logistic regression standard errors

As we know logistic regression is a statical method for preventing binary classes and we know the logistic regression is conducted when the dependent variable is dichotomous.

Here we can work on logistic standard error. The standard error is defined as the coefficient of the model are the square root of their diagonal entries of the covariance matrix.

Code:

In the following code, we will work on the standard error of logistic regression as we know the standard error is the square root of the diagonal entries of the covariance matrix.

from sklearn.metrics import mean_squared_error
y_true = [4, -0.6, 3, 8]
y_pred = [3.5, 0.1, 3, 9]
mean_squared_error(y_true, y_pred)
0.475
y_true = [4, -0.6, 3, 8]
y_pred = [3.5, 0.1, 3, 9]
mean_squared_error(y_true, y_pred, squared=False)
0.712
y_true = [[0.6, 2],[-2, 2],[8, -7]]
y_pred = [[1, 3],[-1, 3],[7, -6]]
mean_squared_error(y_true, y_pred)
0.808
mean_squared_error(y_true, y_pred, squared=False)
0.922
mean_squared_error(y_true, y_pred, multioutput='raw_values')
array=([0.51666667, 2])
mean_squared_error(y_true, y_pred, multioutput=[0.3, 0.7])
0.925

Output:

After running the above code we get the following output in which we can see that the error value is generated and seen on the screen.

Calculate p value logistic regression python
Scikit-learn logistic regression standard error

Read: Scikit learn Random Forest

Scikit-learn logistic regression coefficients

In this section, we will learn about how to work with logistic regression coefficients in scikit-learn.

The coefficient is defined as a number in which the value of the given term is multiplied by each other. Here the logistic regression expresses the size and direction of a variable.

Code:

In the following code, we are importing the libraries import pandas as pd, import numpy as np, import sklearn as sl.

  • The panda library is used for data manipulation and numpy is used for working with arrays.
  • The sklearn library is used for focusing on the modelling data not focusing on manipulating the data.
  • x = np.random.randint(0, 7, size=n) is used for generating the random function.
  • res_sd = sd.Logit(y, x).fit(method=”ncg”, maxiter=max_iter) is used for performing different statical task.
  • print(res_sl.coef_) is used for printing the coefficient on the screen.

import pandas as pd
import numpy as np
import sklearn as sl
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sd

n = 250

x = np.random.randint(0, 7, size=n)
y = (x > (0.10 + np.random.normal(0, 0.10, n))).astype(int)

display(pd.crosstab( y, x ))


max_iter = 150


res_sd = sd.Logit(y, x).fit(method="ncg", maxiter=max_iter)
print(res_sd.params)


res_sl = LogisticRegression( solver='newton-cg', multi_class='multinomial', max_iter=max_iter, fit_intercept=True, C=1e8 )
res_sl.fit( x.reshape(n, 1), y )
print(res_sl.coef_)

Output:

After running the above code we get the following output in which we can see that the scikit learn logistic regression coefficient is printed on the screen.

Calculate p value logistic regression python
scikit learn logistic regression coefficient

Read: Scikit learn Feature Selection

Scikit-learn logistic regression p value

In this section, we will learn about how to calculate the p-value of logistic regression in scikit learn.

Logistic regression pvalue is used to test the null hypothesis and its coefficient is equal to zero. The lowest pvalue is <0.05 and this lowest value indicates that you can reject the null hypothesis.

Code:

In the following code, we will import library import numpy as np which is working with an array.

  • In this firstly we calculate z-score for scikit learn logistic regression.
  • def logit_p1value(model, x): In this, we use some parameters Like model and x.
  • model: is used for fitted sklearn.linear_model.LogisticRegression with intercept and large C
  • x: is used as a matrix on which the model was fit.
  • model = LogisticRegression(C=1e30).fit(x, y) is used to test the pvalue.
  • print(logit_pvalue(model, x)) after testing the value further the value is printed on the screen by this method.
  • sd_model = sd.Logit(y, sm.add_constant(x)).fit(disp=0) is used for comparing the pvalue with statmodels.
import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression

def logit_p1value(model, x):
   
    p1 = model.predict_proba(x)
    n1 = len(p1)
    m1 = len(model.coef_[0]) + 1
    coefs = np.concatenate([model.intercept_, model.coef_[0]])
    x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
    answ = np.zeros((m1, m1))
    for i in range(n1):
        answ = answ + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p1[i,1] * p1[i, 0]
    vcov = np.linalg.inv(np.matrix(answ))
    se = np.sqrt(np.diag(vcov))
    t1 =  coefs/se  
    p1 = (1 - norm.cdf(abs(t1))) * 2
    return p1

x = np.arange(10)[:, np.newaxis]
y = np.array([0,0,0,1,0,0,1,1,1,1])
model = LogisticRegression(C=1e30).fit(x, y)
print(logit_pvalue(model, x))

import statsmodels.api as sd
sd_model = sd.Logit(y, sm.add_constant(x)).fit(disp=0)
print(sd_model.pvalues)
sd_model.summary()

Output:

After running the above code we get the following output in which we can see that logistic regression p-value is created on the screen.

Calculate p value logistic regression python
scikit learn logistic regression p value

Scikit-learn logistic regression feature importance

In this section, we will learn about the feature importance of logistic regression in scikit learn.

Feature importance is defined as a method that allocates a value to an input feature and these values which we are allocated based on how much they are helpful in predicting the target variable.

Code:

In the following code we will import LogisticRegression from sklearn.linear_model and also import pyplot for plotting the graphs on the screen.

  • x, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, random_state=1) is used to define the dtatset.
  • model = LogisticRegression() is used for defining the model.
  • model.fit(x, y) is used to fit the model.
  • imptance = model.coef_[0] is used to get the importance of the feature.
  • pyplot.bar([X for X in range(len(imptance))], imptance) is used for plot the feature importance.

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot

x, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, random_state=1)

model = LogisticRegression()

model.fit(x, y)

imptance = model.coef_[0]

for i,j in enumerate(importance):
	print('Feature: %0d, Score: %.5f' % (i,j))

pyplot.bar([X for X in range(len(imptance))], imptance)
pyplot.show()

Output:

After running the above code we get the following output in which we can see that logistic regression feature importance is shown on the screen.

Calculate p value logistic regression python
scikit learn logistic regression feature importance

Also, read: Scikit-learn Vs Tensorflow – Detailed Comparison

Scikit-learn logistic regression categorical variables

In this section, we will learn about the logistic regression categorical variable in scikit learn.

As the name suggests, divide the data into different categories or we can say that a categorical variable is a variable that assigns individually to a particular group of some basic qualitative property.

Code:

In the following code, we will import some libraries such as import pandas as pd, import NumPy as np also import copy. Pandas are used for manipulating and analyzing the data and NumPy is used for supporting the multiple arrays.

import pandas as pd
import numpy as np
import copy
%matplotlib inline

Here we can upload the CSV data file for getting some data of customers.

df_data.head() is used to show the first five rows of the data inside the file.

df_data = pd.read_csv('data.csv')

df_data.head()

In the following output, we can see that we get the first five-row from the dataset which is shown on the screen.

Calculate p value logistic regression python
scikit learn logistic regression categorical variable data

print(df_data.info()) is used for printing the data information on the screen.

print(df_data.info())

Calculate p value logistic regression python
Printing the data info

Boxplot is produced to display the whole summary of the set of data.

df_data.boxplot('dep_time','origin',rot = 30,figsize=(5,6))

Calculate p value logistic regression python
Boxplot

Here .copy() method is used if any change is done in the data frame and this change does not affect the original data.

cat_df_data = df_data.select_dtypes(include=['object']).copy()

.hed() function is used to check if you have any requirement to fil

cat_df_data.head()

Calculate p value logistic regression python
Filter the columns

Here we use these commands to check the null value in the data set. From this, we can get thethe total number of missing values.

print(cat_df_data.isnull().values.sum())

Calculate p value logistic regression python
Missing values

This checks the column-wise distribution of the null value.

print(cat_df_data.isnull().sum())

Calculate p value logistic regression python
column-wise distribution

.value_count() method is used for returning the frequency distribution of each category.

cat_df_data = cat_df_data.fillna(cat_df_data['tailnum'].value_counts().index[0])

Now we can again check the null value after assigning different methods the result is zero counts.

print(cat_df_data.isnull().values.sum())

Calculate p value logistic regression python
Result of null value

.value_count() method is used for the frequency distribution of the category of the categorical feature.

print(cat_df_data['carrier'].value_counts())

Calculate p value logistic regression python
Frequency distribution

This is used to count the distinct category of features.

print(cat_df_data['carrier'].value_counts().count())

Calculate p value logistic regression python
Feature of different category

  • sns.barplot(carrier_count.index, carrier_count.values, alpha=0.9) is used to plot the bar graph.
  • plt.title(‘Frequency Distribution of Carriers’) is used to give the title to the bar plot.
  • plt.ylabel(‘Number of Occurrences’, fontsize=12) is used to give the label to the y axis.
  • plt.xlabel(‘Carrier’, fontsize=12) is used to give the label to the x-axis .
import seaborn as sns
import matplotlib.pyplot as plt
carrier_count = cat_df_data['carrier'].value_counts()
sns.set(style="darkgrid")
sns.barplot(carrier_count.index, carrier_count.values, alpha=0.9)
plt.title('Frequency Distribution of Carriers')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Carrier', fontsize=12)
plt.show()

In this picture, we can see that the bar chart is plotted on the screen.

Calculate p value logistic regression python
Bar graph of a categorical variable

  • labels = cat_df_data[‘carrier’].astype(‘category’).cat.categories.tolist() is used to give the labels to the chart.
  • sizes = [counts[var_cat] for var_cat in labels] is used to give the size to pie chart.
  • fig1, ax1 = plt.subplots() is used to plot the chart.
labels = cat_df_data['carrier'].astype('category').cat.categories.tolist()
counts = cat_df_data['carrier'].value_counts()
sizes = [counts[var_cat] for var_cat in labels]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True) #autopct is show the % on plot
ax1.axis('equal')
plt.show()

In the following output, we can see that a pie chart is plotted on the screen in which the values are divided into categories.

Calculate p value logistic regression python
Plotting the pie chart

Read: Scikit learn Sentiment Analysis

Scikit-learn logistic regression cross-validation

In this section, we will learn about logistic regression cross-validation in scikit learn.

  • As we know scikit learn library is used for focused on modeling data. It just focused on modeling the data not loading the data.
  • Here the use of scikit learn we also create the result of logistic regression cross-validation.
  • Cross-validation is a method that uses the different positions of data for the testing train and test models on different iterations.

Code:

In the following code, we import different libraries for getting the accurate value of logistic regression cross-validation.

  • x, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) is used for creating the dataset.
  • CV = KFold(n_splits=10, random_state=1, shuffle=True) is used for preparing the cross validation procedure.
  • model = LogisticRegression() is used for creating a model.
  • score = cross_val_score(model, x, y, scoring=’accuracy’, cv=CV, n_jobs=-1) is used for evaluating the model.
  • print(‘Accuracy: %.3f (%.3f)’ % (mean(score), std(score))) is used preparing report performance.
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression


x, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)

CV = KFold(n_splits=10, random_state=1, shuffle=True)


model = LogisticRegression()


score = cross_val_score(model, x, y, scoring='accuracy', cv=CV, n_jobs=-1)

print('Accuracy: %.3f (%.3f)' % (mean(score), std(score)))

Output:

After running the above code we get the following output in which we can see that the accuracy of cross-validation is shown on the screen.

Calculate p value logistic regression python
scikit learn logistic regression cross-validation

Scikit-learn logistic regression threshold

In this section, we will learn about How to get the logistic regression threshold value in scikit learn.

  • As we know logistic regression is a statical method of preventing binary classes. Binary classes are defined as 0 or 1 or we can say that true or false.
  • Here logistic regression assigns each row as a probability of true and makes a prediction if the value is less than 0.5 its take value as 0.
  • The default value of the threshold is 0.5.

Code:

In the following code, we will import different methods from which we the threshold of logistic regression. The default value of the threshold is 0.5 and if the value of the threshold is less than 0.5 then we take the value as 0.

  • X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) is used to generate the dataset.
  • trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) is used to split the data into train and test.
  • models.fit(trainX, trainy) is used fit the model.
  • yhat = model.predict_proba(testX) is used to predict the probability.
  • yhat = yhat[:, 1] is used to keep the probability for positive outcome only.
  • fpr, tpr, thresholds = roc_curve(testy, yhat) is used to calculate the roc curve.

from numpy import argmax
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve

X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0,
	n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)

trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)

models = LogisticRegression(solver='lbfgs')
models.fit(trainX, trainy)

yhat = model.predict_proba(testX)

yhat = yhat[:, 1]

fpr, tpr, thresholds = roc_curve(testy, yhat)
Jt = tpr - fpr
ix = argmax(Jt)
best_threshold = thresholds[ix]
print('Best Threshold=%f' % (best_threshold))

Output:

After running the above code we get the following output in which we can see the value of the threshold is printed on the screen.

Calculate p value logistic regression python
scikit learn logistic regression threshold

So, in this tutorial, we discussed scikit learn logistic regression and we have also covered different examples related to its implementation. Here is the list of examples that we have covered.

  • Scikit-learn logistic regression
  • Scikit-learn logistic regression standard errors
  • Scikit-learn logistic regression coefficients
  • Scikit-learn logistic regression p value
  • Scikit-learn logistic regression feature importance
  • Scikit-learn logistic regression categorical variables
  • Scikit-learn logistic regression cross-validation
  • Scikit-learn logistic regression threshold

Calculate p value logistic regression python

Python is one of the most popular languages in the United States of America. I have been working with Python for a long time and I have expertise in working with various libraries on Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… I have experience in working with various clients in countries like United States, Canada, United Kingdom, Australia, New Zealand, etc. Check out my profile.

What is p

P-Value is a statistical test that determines the probability of extreme results of the statistical hypothesis test,taking the Null Hypothesis to be correct. It is mostly used as an alternative to rejection points that provides the smallest level of significance at which the Null-Hypothesis would be rejected.

Does logistic regression have P

For binary logistic regression, the format of the data affects the p-value because it changes the number of trials per row. Deviance: The p-value for the deviance test tends to be lower for data that are in the Binary Response/Frequency format compared to data in the Event/Trial format.

How P

For simple regression, the p-value is determined using a t distribution with n − 2 degrees of freedom (df), which is written as t n − 2 , and is calculated as 2 × area past |t| under a t n − 2 curve. In this example, df = 30 − 2 = 28.

How do you determine significant variables in regression Python?

So, finding the p-value for each coefficient will tell if the variable is statistically significant to predict the target. As a general rule of thumb, if the p-value is less than 0.05: there is a strong relationship between the variable and the target.