A cool property of ML is that we can interpret the logits, outputs of the model, as anything we want. This leads to all sorts of rich interpretations.

Well, almost anything. The interpretations from logits don’t actually come from thin air, but derived from a few simple assumptions about the prior distribution and linear relationship. So from here our interpretation of logits, and our subsequent choice of estimator function, actually make sense!

Here, we’ll derive the equations for generalized linear models (GLMs), which form the foundation of classical ML/supervised learning.

Exponential family

To derive the family of GLMs, we first need to define the exponential family. A distribution is part of GLM if it can be written into following form:

\[p(y;\eta)=b(y)\exp(\eta^{T}T(y)-a(\eta))\]

This equation makes sense if you think about it in terms of Gaussian/Normal dist (verify for yourself that it looks similar). \(a(n)\) is normalization factor to make the area under the curve 1, for Gaussian it is \(\frac{1}{\sqrt{ 2\pi }}\), \(\eta\) is natural parameter, \(T(y)\) is the sufficient statistic and normally set to \(y\), \(b(y)\) is some term which doesn’t depend on \(\eta\).

Constructing GLMs

Assume prior probability -> what is the estimator? Give me any prior (ex. Gaussian) and I should be able to give you the respective classifier which operates under this assumption.

Distribution: \(y\vert x;\theta\), the prior distribution of y, is part of Exponential family (ex. Gaussian priors)
Linear relationship: The natural parameter \(\eta\) is related linearly with \(x\), \(\eta=\theta^{\top}x\)

Recall that in regression we want to predict the mean \(\mu\) (more precisely \(E[y\vert x]\)). In classification we want to minimize NLL (equivalent to MLE)

Write probability distribution in Exponential Family form \(p(y;\eta)\)
Find \(\eta\) (and optionally other parameters) in terms of \(\mu\), the mean of the probability distribution
Invert to solve for \(\mu\) in terms of \(\eta\). This is your canonical response function

For example, say I give you a dataset X and want you to predict the expected count y (e.g. X is weather/day of week…, y is how many customers you’ll see. Lots of processes for waiting time / counts are Poisson processes!). The Poisson distribution is \(p(y; \lambda) = \frac{e^{-\lambda}\lambda^y}{y!}\) \(\eta=\log(\lambda)\) so our mean we are predicting isis \(\lambda=e^\eta\) (verify this by putting it in exponential family form). So given X, you will output counts \(y=e^{\theta^{\top}X}\) and optimize for \(\theta\). This has a nice interpretation because, recalling our familiar softmax \(\frac{e^{z_{i}}}{\sum_{j}e^{z_{j}}}\), we have resp. prob = prob / total prob = counts / total counts. The poisson regression is just the numerator of the sigmoid, which is the expected counts! We’ve just discovered the Poisson regression.

Summary

You can do this for each distribution¹ to get the following mapping from distribution to estimator model. Here are a few common ones:

Bernoulli => logistic/sigmoid \(\mu=\phi=\sigma(\eta)=\frac{1}{1+e^{-\eta}}\)
Multinomial => softmax \(\frac{e^{\eta_{i}}}{\sum_{j}e^{\eta_{j}}}\)
Gaussian => ordinary least squares \(\mu=\eta\) so predict \(y=\theta^{\top}X\)

Note: called OLS because the loss function is least squares \((y-\hat{y})^2\)

GLMs have quite a few other nice properties, such as being convex in terms of model parameters \(\theta\) (so the local minimum is the global minimum, and we have convergence guarantees for supervised learning), as well as having easy mean and variance calculations based on derivative \(a’(\eta)\). This deserves another discussion on its own², but it’s cool to see how a general family of functions with these properties can solve all sorts of problems in ML.

https://en.wikipedia.org/wiki/Generalized_linear_model#Link_function ↩
read cs 229 andrew ng lecture notes ↩