Damjan Vukcevic
University of Melbourne
Transform outcome probabilities onto the log odds scale: \[\mathrm{logit}(p) = \log\left(\frac{p}{1-p}\right)\]
Such a transformation is called a link function
\(Y_i\) is binary outcome variable (taking values 0 or 1)
\(X_i\) is a predictor variable
Logistic regression model: \[p_i = \Pr(Y_i = 1 \mid X_i)\] \[\mathrm{logit}(p_i) = \log\left(\frac{p_i}{1-p_i}\right) = \mu + \beta X_i\]
\(Y_i\) are independent, conditional on \(X_i\)
\(Y_i\) is binary outcome variable (taking values 0 or 1)
\(X_i\) is a predictor variable
Logistic regression model: \[p_i = \Pr(Y_i = 1 \mid X_i)\] \[\mathrm{logit}(p_i) = \log\left(\frac{p_i}{1-p_i}\right) = \mu + \beta X_i\]
\(Y_i\) are independent, conditional on \(X_i\)
Note:
Let \(\mu = -6\) and \(\beta = 0.05\), \[\mathrm{logit}(p_i) = -6 + 0.05 \, X_i\]
Let \(\mu = -6\) and \(\beta = 0.05\), \[\mathrm{logit}(p_i) = -6 + 0.05 \, X_i\]
Re-arrange, \[p_i = \frac{e^{-6 + 0.05 X_i}}{1 + e^{-6 + 0.05 X_i}}\]
Let \(\mu = -6\) and \(\beta = 0.05\), \[\mathrm{logit}(p_i) = -6 + 0.05 \, X_i\]
Re-arrange, \[p_i = \frac{e^{-6 + 0.05 X_i}}{1 + e^{-6 + 0.05 X_i}}\]
Plug in some values: \[X_i = 100 \quad \mapsto \quad p_i = 0.27\] \[X_i = 200 \quad \mapsto \quad p_i = 0.98\]
Re-write the model: \[\begin{aligned} \mathrm{logit}(p_i) &= \mu + \beta X_i \\ \Rightarrow \quad \frac{p_i}{1 - p_i} &= e^{\mu + \beta X_i} \end{aligned}\]
Re-write the model: \[\begin{aligned} \mathrm{logit}(p_i) &= \mu + \beta X_i \\ \Rightarrow \quad \frac{p_i}{1 - p_i} &= e^{\mu + \beta X_i} \end{aligned}\]
Consider binary \(X_i \in \{0, 1\}\) \[X_i = 0 \quad \Rightarrow \quad \frac{p_i}{1 - p_i} = e^{\mu} \phantom{e^{\beta}}\] \[X_i = 1 \quad \Rightarrow \quad \frac{p_i}{1 - p_i} = e^{\mu} e^{\beta}\]
Re-write the model: \[\begin{aligned} \mathrm{logit}(p_i) &= \mu + \beta X_i \\ \Rightarrow \quad \frac{p_i}{1 - p_i} &= e^{\mu + \beta X_i} \end{aligned}\]
Consider binary \(X_i \in \{0, 1\}\) \[X_i = 0 \quad \Rightarrow \quad \frac{p_i}{1 - p_i} = e^{\mu} \phantom{e^{\beta}}\] \[X_i = 1 \quad \Rightarrow \quad \frac{p_i}{1 - p_i} = e^{\mu} e^{\beta}\]
\(e^{\mu}\) is the baseline odds
\(e^{\beta}\) is the odds ratio (OR), and \(\beta\) is the log odds ratio
Logistic regression: \[\mathrm{logit}\left(\Pr\left(Y_i = 1\right)\right) = \mu + \beta X_i\]
Linear regression: \[Y_i = \mu + \beta X_i + \epsilon_i\] \[\epsilon_i \sim \mathrm{N}(0, \sigma^2)\]
Differences:
The likelihood is the probability of the observed outcome variables: \[L(\mu, \beta) = \prod_i \Pr(Y_i = y_i \mid X_i = x_i)\]
The likelihood is the probability of the observed outcome variables: \[L(\mu, \beta) = \prod_i \Pr(Y_i = y_i \mid X_i = x_i)\]
In this case, Bernoulli random variables: \[L(\mu, \beta) = \prod_i p_i^{y_i} (1 - p_i)^{1 - y_i}\]
The likelihood is the probability of the observed outcome variables: \[L(\mu, \beta) = \prod_i \Pr(Y_i = y_i \mid X_i = x_i)\]
In this case, Bernoulli random variables: \[L(\mu, \beta) = \prod_i p_i^{y_i} (1 - p_i)^{1 - y_i}\]
The parameters are hidden inside the \(p_i\) terms, \[p_i = \frac{e^{\mu + \beta x_i}}{1 + e^{\mu + \beta x_i}}\]
The likelihood is the probability of the observed outcome variables: \[L(\mu, \beta) = \prod_i \Pr(Y_i = y_i \mid X_i = x_i)\]
In this case, Bernoulli random variables: \[L(\mu, \beta) = \prod_i p_i^{y_i} (1 - p_i)^{1 - y_i}\]
The parameters are hidden inside the \(p_i\) terms, \[p_i = \frac{e^{\mu + \beta x_i}}{1 + e^{\mu + \beta x_i}}\]
Maximise the likelihood using the iteratively reweighted least squares (IRLS) method. Estimation and testing then follow the same as for linear regression...
Parameter estimates from maximum likelihood
Standard error from the Fisher information matrix
Example: \[\begin{aligned} \hat\mu &= -5.5 &(\textrm{s.e. }\: 0.84) \\ \hat\beta &= 0.038 &(\textrm{s.e. }\: 0.0063) \end{aligned}\]
Easier to interpret as an odds ratio: \[\textrm{OR} = e^{\hat\beta} = 1.04\]
Can also calculate a confidence interval for the OR: \[\textrm{95% CI} \approx e^{\hat\beta \pm 2 \mathrm{se}(\hat\beta)} = (1.03, 1.05)\]
Standard likelihood theory: estimates are asymptotically unbiased, efficient and normally distributed
Usually interested in tests of effect parameters, for example: \[\begin{cases} H_0\colon \: \beta = 0 \\ H_1\colon \: \beta \ne 0 \end{cases}\]
Use a likelihood ratio test to carry this out:
Usually interested in tests of effect parameters, for example: \[\begin{cases} H_0\colon \: \beta = 0 \\ H_1\colon \: \beta \ne 0 \end{cases}\]
Use a likelihood ratio test to carry this out:
Example: \[\textrm{p-value} = 2.5 \times 10^{-12}\]
Can add more predictor variables, the same as for linear regression:
For example: \[\mathrm{logit}(p_i) = \mu + \beta_1 X_i + \beta_2 X_i^2 + \gamma_j \mathrm{I}(Z_i = j)\]
Case-control studies: only ORs are estimable
Logistic regression: estimates ORs
A perfect match!
Case-control studies: only ORs are estimable
Logistic regression: estimates ORs
A perfect match!
Example: genome-wide association studies (GWAS)
Consider a disease study with 1000 cases and 1000 controls
For individual \(i\):
Consider a disease study with 1000 cases and 1000 controls
For individual \(i\):
Example data:
\(G = 0\) | \(G = 1\) | \(G = 2\) | |
---|---|---|---|
\(Y = 0\) (controls) | 360 | 468 | 172 |
\(Y = 1\) (cases) | 301 | 486 | 213 |
\[\mathrm{logit}(p_i) = \mu + \beta G_i\]
\(G_i\) is numeric
Each copy of the SNP allele increases the odds of disease by \(e^{\beta}\)
This model underlies the vast majority of analyses
\[\mathrm{logit}(p_i) = \mu + \beta G_i\]
\(G_i\) is numeric
Each copy of the SNP allele increases the odds of disease by \(e^{\beta}\)
This model underlies the vast majority of analyses
\[\mathrm{logit}(p_i) = \mu + \beta_1 \mathrm{I}(G_i = 1) +
\beta_2 \mathrm{I}(G_i = 2)\]
\(G_i\) is categorical
The heterozygote OR is \(e^{\beta_1}\) and the homozygote OR is \(e^{\beta_2}\)
Parameter estimates: \[\begin{aligned} \hat\mu &= -0.17 &(\textrm{s.e. }\: 0.071) \\ \hat\beta &= 0.20 &(\textrm{s.e. }\: 0.063) \end{aligned}\]
\[\begin{aligned} \textrm{OR} &= e^{\hat\beta} = 1.22 \\ \textrm{95% CI} &\approx (1.08, 1.38) \end{aligned}\]
Association test: \[\begin{cases} H_0\colon \: \beta = 0 \\ H_1\colon \: \beta \ne 0 \end{cases}\]
\[\textrm{p-value} = 0.0017\]
NOT 'genome-wide significant'
Can easily add covariates:
\[\mathrm{logit}(p_i) = \mu + \beta X_i + \mathbf{\gamma Z} \]
For example: sex, ethnicity, principal components,...
Inference and testing follows similarly
Can easily add covariates:
\[\mathrm{logit}(p_i) = \mu + \beta X_i + \mathbf{\gamma Z} \]
For example: sex, ethnicity, principal components,...
Inference and testing follows similarly