How do i fit a glm model in python?
In ordinary linear regression, we treat our outcome variable as a linear combination of several input variables plus some random noise, typically assumed to be Normally distributed. While this modeling approach is easily interpreted, efficiently implemented, and capable of accurately capturing many linear relationships, it does come with several significant limitations. A simple extension of linear models, a Generalized Linear Model (GLM) is able to relax some of linear regression’s most strict assumptions. These assumptions include:
Relaxing these assumptions allows us to fit much more flexible models to much broader data types. GLMs can be easily fit with a few lines of code in languages like R or Python, but to understand how a model works, it’s always helpful to get under the hood and code it up yourself. This article shows how to implement GLMs from scratch using only Python’s Numpy package. For more on the basics and intuition on GLMs, check out this article or this book. GLM StructureFitting a GLM first requires specifying two components: a random distribution for our outcome variable and a link function between the distribution’s mean parameter and its “linear predictor”. The Random ComponentThe first step to building our GLM is identifying the distribution of the outcome variable. If the data has a binary response, we might want to use the Bernoulli or Binomial distributions. If we are working with count data, a Poisson model might be more useful. This distribution is typically assumed to come from the Exponential Family of distributions, which includes the Binomial, Poisson, Negative Binomial, Gamma, and Normal. Each of these models can be expressed in terms of its mean parameter, 𝜇 = E(Y). For instance, we specify a binomial model as Y ~ Bin(n, p), which can also be written as Y ~ Bin(n, 𝜇/n). Once we estimate 𝜇, we model Y as coming from a distribution indexed by 𝜇̂ and our predicted value of Y is simply 𝜇̂. The Link FunctionIn a GLM, we estimate 𝜇 as a non-linear function of a “linear predictor” 𝜂, which itself is a linear function of the data. The non-linear function connecting 𝜇 to 𝜂 is called the link function, and we determine it before model-fitting. The link function is written as a function of 𝜇, e.g. 𝜂 = g(𝜇). Suppose we have the following training data where each x is a D-dimensional vector: We first write 𝜂 as a linear function of x for each observation n = 1, … , N: Then we connect 𝜂 to 𝜇 with the link function: Fitting a GLMTo fit the GLM, we are actually just finding estimates for the βs: from these, we obtain estimates of 𝜂, which leads immediately to an estimate for 𝜇, which then gives us an estimated distribution for Y! To estimate the βs, follow these steps:
Linear Regression — A Special Case of the GLMTo reinforce our understanding of this structure, let’s first write out a typical linear regression model in GLM format. As step 1, let’s specify the distribution of Y. Recall that a typical linear model assumes where β is a length-D vector of coefficients (this assumes we’ve added a 1 to each x so the first element in β is the intercept term). Note that the mean of this distribution is a linear combination of the data, meaning we could write this model in terms of our linear predictor by letting Then for step 2, we need to find the function linking 𝜇 and 𝜂. In the case of linear regression, it’s simple. Since E(Y) = 𝜇 and the mean of our modeled Y is 𝜂, we have 𝜂 = g(𝜇) = 𝜇! Step 3: let’s find the negative log-likelihood. Note that our loss function is proportional to the sum of the squared errors. Finally for step 4, let’s see if we can minimize this loss function analytically. Start by taking the derivative with respect to β and setting it equal to 0. This gives the closed-form solution we know and love from ordinary linear regression. Now for the simple coding. Let’s randomly generate some normally-distributed Y values and fit the model. The scatterplot below shows that our fitted values for β are quite close to the true values. import numpy as np Logistic Regression — A GLM for Binary DataIn logistic regression, we model our outputs as independent Bernoulli trials. I.e. we assume That completes step 1. For step 2, we must find a way to relate our linear predictor 𝜂 to our parameter p. Since p is between 0 and 1 and 𝜂 can be any real number, a natural choice is the log-odds. I.e., Inversely, we use the sigmoid function to get from 𝜂 to p (which I will call S): This wraps up step 2. For step 3, find the negative log likelihood. This gives us our loss function and finishes step 3. For step 4, we find the values of β to minimize this loss. Unfortunately, in the logistic regression case, there is no closed-form solution, so we must use gradient descent. So, let’s find the derivative of the loss function with respect to β. First, note that S’(x) = S(x)(1-S(x)): Then we can derive the entire gradient: To speed up calculations in Python, we can also write this as where and Now let’s fit the model using gradient descent. Again, the scatterplot below shows that our fitted values for β are quite close to the true values. np.random.seed(123)## GENERATE Ys ### Generate response as a function of the same X and beta Poisson Regression — A GLM for Count DataThe Poisson is a great way to model data that occurs in counts, such as accidents on a highway or deaths-by-horse-kick. Step 1: Suppose we have Step 2, we specify the link function. The link function must convert a non-negative rate parameter λ to the linear predictor η ∈ ℝ. A common function is which of course has inverse Now for step 3, find the negative log-likelihood. And step 4, optimize for β. Sadly, there is no closed-form solution, so again we turn to gradient descent, as implemented below. Once again, the estimated parameters are plotted against the true parameters and once again the model does pretty well. np.random.seed(123)## GENERATE Ys ### Generate response as a function of the same X and beta When building GLMs in practice, R’s How do you fit a GLM in Python?How to fit a GLM in Python?. Importing statsmodels import statsmodels.api as sm.. Support for formulas import statsmodels.formula.api as smf.. Use glm() directly from statsmodels.formula.api import glm.. How do you fit a GLM?Fitting a GLM. Specify the distribution of Y as a function of 𝜇.. Specify the link function, 𝜂 = g(𝜇).. Identify a loss function. Here, we use the negative log-likelihood. ... . Find the β values to minimize the loss function, either through a closed-form solution or with gradient descent.. How do you build a GLM model?GLM in R: Generalized Linear Model with Example. What is Logistic regression?. How to create Generalized Liner Model (GLM). Step 1) Check continuous variables.. Step 2) Check factor variables.. Step 3) Feature engineering.. Step 4) Summary Statistic.. Step 5) Train/test set.. Step 6) Build the model.. What distribution should I use for GLM?If your outcome is continuous and unbounded, then the most "default" choice is the Gaussian distribution (a.k.a. normal distribution), i.e. the standard linear regression (unless you use other link function then the default identity link).
|