Bayesian Inference Notes

1. Bayes Theorem

1.1. Bayes theorem

𝑝(πœƒ|𝑦)=𝑝(𝑦|πœƒ)𝑝(πœƒ)𝑝(𝑦)βˆπ‘(𝑦|πœƒ)𝑝(πœƒ)

1.2. Prior predictive distribution

1.3. Posterior predictive distribution

2. Fundamental Distributions

NamePDF/PMFMeanVarianceMode
BetaΒ (𝑦|𝛼,𝛽)Ξ“(𝛼+𝛽)Ξ“(𝛼)Ξ“(𝛽)π‘¦π›Όβˆ’1(1βˆ’π‘¦)π›½βˆ’1𝛼𝛼+𝛽𝛼𝛽(𝛼+𝛽)2(𝛼+𝛽+1)π›Όβˆ’1𝛼+π›½βˆ’2
BinomialΒ (𝑦|𝑛,𝑝)(𝑛𝑦)𝑝𝑦(1βˆ’π‘)π‘›βˆ’π‘¦π‘›π‘π‘›π‘(1βˆ’π‘)
ExponentialΒ (𝑦|πœ†)πœ†π‘’βˆ’πœ†π‘¦1πœ†ln2πœ†0
ErlangΒ (𝑦|πœ†,π‘˜)πœ†π‘˜π‘¦π‘˜βˆ’1π‘’βˆ’πœ†π‘¦(π‘˜βˆ’1)!π‘˜πœ†π‘˜πœ†21πœ†(π‘˜βˆ’1)
ExGaussΒ (𝑦|πœ‡,𝜎,πœ†)πœ†2exp(πœ†2(2πœ‡+πœ†πœŽ2βˆ’2𝑦))Β erfcΒ (πœ‡+πœ†πœŽ2βˆ’π‘¦2𝜎
GammaΒ (𝑦|𝛼,𝛽)𝛽𝛼Γ(𝛼)π‘¦π›Όβˆ’1π‘’βˆ’π›½π‘¦π›Όπ›½π›Όπ›½2π›Όβˆ’1𝛽
InvGammaΒ (𝑦|𝛼,𝛽)𝛽𝛼Γ(𝛼)π‘¦βˆ’π›Όβˆ’1π‘’βˆ’π›½/π‘¦π›½π›Όβˆ’1𝛽2(π›Όβˆ’1)2(π›Όβˆ’2)π›½π›Όβˆ’1
LogNormalΒ (𝑦|𝛼,𝛽)1π‘¦πœŽ2πœ‹π‘’βˆ’(lnπ‘¦βˆ’πœ‡)22𝜎2π‘’πœ‡+𝜎22(π‘’πœŽ2βˆ’1)𝑒2πœ‡+𝜎2π‘’πœ‡βˆ’πœŽ2
PossionΒ (𝑦|πœ†)πœ†π‘¦π‘’βˆ’πœ†π‘¦!πœ†πœ†
NegBinomialΒ (π‘˜|π‘Ÿ,𝑝)(π‘˜+π‘Ÿβˆ’1π‘˜)(1βˆ’π‘)π‘˜π‘π‘Ÿπ‘Ÿ(1βˆ’π‘)π‘π‘Ÿ(1βˆ’π‘)𝑝2
NormalΒ (𝑦|πœ‡,𝜎2)12πœ‹πœŽexp(βˆ’(π‘¦βˆ’πœ‡)22𝜎2)πœ‡πœŽ2πœ‡
StudentΒ (𝑦|𝜈)Ξ“(𝜈+12)πœ‹πœˆΞ“(𝜈2)(1+𝑦2𝜈)βˆ’πœˆ+120πœˆπœˆβˆ’20
UniformΒ (𝑦|π‘Ž,𝑏)1π‘βˆ’π‘Žπ‘Ž+𝑏2(π‘βˆ’π‘Ž)212
TableΒ 1: Single Variate Distributions

3. Functions

3.1. Beta Function

Properties:

4. Conjugate Prior

The idea of conjugate prior is that for a give likelihood we choose a prior distribution such that, after observing data and applying Bayes’ theorem, the posterior distribution belongs to the same family as the prior.

That is, if 𝑝(πœƒ) and 𝑝(πœƒ|𝑦) have the same distributional form, then the prior is called a conjugate prior for the likelihood model.

This is useful because it makes Bayesian updating analytically tractable. Instead of performing difficult integration or numerical approximation, we can often derive the posterior parameters in closed form.

5. Conjugate Prior for Exponential Families

Note general exponential family:

𝑝(𝑦𝑖|πœƒ)=

⁠

⁠

⁠

⁠

⁠exp(

⁠

⁠

⁠

⁠

⁠

⁠

⁠

⁠

⁠

⁠
βˆ’

⁠

⁠

⁠

⁠

⁠
)

⁠

⁠

⁠

⁠

Likelihood of a sequence of i.i.d.samples:

𝑝(𝑦|πœƒ)=

⁠

⁠

⁠

⁠

⁠exp(

⁠

⁠

⁠

⁠

⁠

⁠

⁠

⁠

⁠

⁠
βˆ’

⁠

⁠

⁠

⁠

⁠
)

⁠

So conjugate prior for that likelihood is

Posterior is

6. Proper and Improper Prior Distributions

A prior is called proper if it is a valid probability distribution:

And improper if

In theory, all priors are acceptable, as long as the posterior is proper.

7. Fisher Information Matrix

8. Jeffreys’ Prior

9. Pivotal Quantities

For the binomial and other single-parameter models, different principles give (slightly) different noninformative prior distributions. But for two casesβ€”location parameters and scale parametersβ€”all principles seem to agree[1].

9.1. Location Parameter

𝑝(πœƒ)∼1

9.2. Scale Parameter

𝑝(πœƒ)∼1πœƒ

10. Predictive Accuracy

People care about the accuracy in two different ways. First to assume that the model is all we known and check posterior predictions. The second is to compare several candidate models. Even if all of the models being considered have mismatches with the data, it can be informative to evaluate their predictive accuracy, compare them, and consider where to go next[2].

11. KL Divergence

12. Linear Algebra

12.1. Convex Combination

A subset 𝐴 of a vector space 𝑉 is said to be convex if

for all vectors

, and all scalars πœ† in [0,1].

Via induction, this can be seen to be equivalent to the requirement that

for all vectors

, and for all scalars πœ†1,πœ†2,…,πœ†π‘›β‰₯0 such that βˆ‘π‘˜π‘–=1.

Bibliography

  • [1] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and others, Bayesian Data Analysis, Third. Boca Raton, Florida: Crc, 2013. [Online]. Available: https://stat.columbia.edu/~gelman/book/
  • [2] A. Gelman, J. Hwang, and A. Vehtari, β€œUnderstanding predictive information criteria for Bayesian models,” Statistics and Computing, vol. 24, no. 6, pp. 997–1016, Nov. 2014, doi: 10.1007/s11222-013-9416-2.