MCMC and Statistical Inference¶
Suppose we wish to study the relation between two observables, $X$ and $Y$, as we suspect that there exists a mathematical rule: our theory tells us that
\begin{align}
Y = h_\theta(X)
\end{align}
where $h_\theta$ is a given function (called *hypothesis*) of $X$ and of a number of other parameters $\theta$. Our purpose is to find the correct values of parameters $\theta = (\theta_1,..,\theta_n)$.
To do so, we must collect data. We therefore perform the simultaneous measurement of observables $X$ and $Y$, collecting many such measurements. Let us denote each measurement outcome by $(x^{(a)},y^{(a)})$, with $a=1,..,A$. Since the data suffers from measurement errors, we cannot claim that $h_\theta(x^{(a)}) = y^{(a)}$, even if our hypothesis is perfectly correct and we choose the "correct" values of $\theta$. We thus define the errors of each measurement as
\begin{align}
\epsilon_a := y^{(a)} - h_\theta(x^{(a)})
\end{align}
The assumption is now that these errors are normally distributed. More precisely, the $A$-dimensional column-vector $\vec \epsilon$ (whose $a$th element is $\epsilon_a$) is the outcome of a random variable $\vec E$ distributed according to $\mathcal N(\vec 0, \Sigma)$, where $\Sigma$ is a given $A \times A$ covariance matrix. Therefore, we can write the pdf of $\vec \epsilon$ with parameters $\theta$ as
\begin{align}
p_{\vec E}(\vec \epsilon; \theta) = \dfrac{1}{\sqrt{(2\pi)^m|\det\Sigma|}} e^{-\frac{1}{2} \vec \epsilon^T \Sigma^{-1} \vec \epsilon}
\end{align}
Now, the parameters $\theta$ are unknown, and so -- from a Bayesian point of view -- we can think of $\theta$ as the outcome of a random variable $\Theta$. Then, $p_{\vec E}(\vec \epsilon; \theta)$ can be regarded as a conditional probability $p_{\vec E | \Theta}(\vec \epsilon | \theta)$: the probability of finding data errors $\vec E = \vec \epsilon$ given that the parameters are $\Theta = \theta$. By Bayes' theorem, we can thus write the inverse conditional probability, that is, the pdf of finding parameters $\Theta = \theta$ given the data errors are $\vec E = \vec \epsilon$:
\begin{align}
p_{\Theta | \vec E}(\theta | \vec \epsilon) = \dfrac{p_{\vec E | \Theta}(\vec \epsilon | \theta) p_\Theta(\theta)}{p_{\vec E}(\vec \epsilon)}
\end{align}
This $p_{\Theta | \vec E}$ is the pdf we are actually interested in, since it allows us to compute the most probable value of the parameters $\Theta$, as well as the error:
\begin{align}
\theta_* = \mathbb E[\Theta], \ \ \ \ \ \sigma := \sqrt{\text{Var}[\Theta]} = \sqrt{\mathbb E[\Theta^2] - \mathbb E[\Theta]^2}
\end{align}
We therefore wish to make use of MCMC to sample $m$ points following $p_{\Theta | \vec E}$ and to estimate the various expected values we wish to compute.
Now, recall that for Metropolis algorightm we do not need the actual distribution, since it suffices to have a function proportional to it: we then choose a uniform prior (i.e., $p_\Theta(\theta)$ independent of $\theta$ within some bounds), and find that -- as a function of $\theta$ -- the pdf $p_{\Theta | \vec E}(\theta | \vec \epsilon)$ is proportional to $p_{\vec E | \Theta}(\vec \epsilon | \theta) = p_{\vec E}(\vec \epsilon; \theta)$. In the algorithm we can then set
\begin{align}
L(\theta) = e^{-\frac{1}{2} \vec \epsilon^T \Sigma^{-1} \vec \epsilon}
\end{align}
which incidentally is the *likelihood* of $\theta$ given measurement errors $\vec \epsilon$ and covariance matrix $\Sigma$. We can write this following the notation of the previous example:
\begin{align}
L(\theta) = e^{-S(\theta)}
\end{align}
where
\begin{align}
S(\theta) := \frac{1}{2} (\vec y - h_\theta(\vec x))^T \Sigma^{-1} (\vec y - h_\theta(\vec x))
\end{align}
is (minus) the *log-likelihood*. Comparing with standard regression, we see that $S$ is a generalization of the least-mean-square cost function, which is also taking into account correlation in the measurement errors (encoded in $\Sigma$).
Recall that standard regression aims at finding $\theta_*$ that minimizes $S$ (e.g., by gradient descent): this procedure only produces a single value $\theta_*$ for $\theta$. On the other hand, MCMC produces a whole pdf for $\theta$, which in particular allows us to evaluate the errors $\sigma_{i}$ on each parameter $\theta_i$ (with $i=1,2,..,n$) and to estimate its relevance for fitting the data by computing the *signal-to-noise* ratio (also called *t-value*):
\begin{align}
t_i := \dfrac{\theta_{*i}}{\sigma_{i}}
\end{align}
The higher $t_i$, the more sure we are that the corresponding parameter $\theta_i$ is relevant in the model: as a rule of thumb, we should retain parameters with $|t_i| > 3$, while discarding the other ones.