Likelihood and Estimation


Statistical Inference is concerned with two inter-related issues: 1) estimation (of an unknown parameter); and 2) hypothesis testing.

This note is concerned with the first of these. We’ll set the scene with a simple example.

Example : Is the coin fair?

Suppose you are given a coin and all that you are told is that it is biased. That is, the chances of landing ‘heads’ or ‘tails’ is not the same. Your problem is to estimate the true probability of observing either a head or a tail.

A classically ‘Frequentist’ approach to this problem is to flip the coin a number of times and record the number of heads and tails. The relative frequency with which heads/tails is observed becomes the estimate of the true probability of heads/tails. This simple example succinctly illustrates the concept of a relative frequency as a surrogate for a probability (and hence the name frequentist statistics). Furthermore, intuition suggests that the precision of this estimate will increase as the number of tosses of the coin increases.

We can demonstrate this quite easily with a small simulation in R. There are many ways of doing this, but we’ll use basic functions rather than intrinsic R probability functions.

theta<-0.36  #  arbitarily set the true probability of a head
N<-1000  #  set the total number of tosses
    #  this tallies the number of times we obtain a "1" and divides by the number of tosses up to that point
plot(seq(1,N),phat,ylim=c(0,1),type="l",xlab="Number of tosses",ylab="Rel. Freq.")

As can be seen, the relative frequency improves the more times we toss the coin and indeed, eventaully converges to the correct value.

Thus, our intuitive estimator: \(\hat p = \frac{{\sum\limits_{i = 1}^n {{X_i}} }}{n}\) (where \({{X_i}}\) is 1 if the \({i^{th}}\) toss lands heads and zero otherwise) for the true probability, \(\theta\) seems to work satisfactorily. But is it the ‘best’ estimator for \(\theta\)?

Well, it depends on how you define best. Statisticians generally invoke two principles to identify a best estimator: 1) unbiasedness; and 2) minimum variance.

The first of these is used to narrow the search among all candidate estimators to only those that result in the unbiassed estimation of the unknown parameter. By this we mean that, in the long run, the estimator will neither over-estimate nor under-estimate the true value. More formally we say the expected value of the estimator is the true value.

Be careful here, the terminology expected value in statistics has a precise mathematical definition which may not coincide with what we “expect”. The expected value (of a random variable) is the mean or arithmetic average.

For a discrete random variable as is our case here; we write:

\[E\left[ X \right] = \sum\limits_i {{x_i}} \,P\left[ {X = {x_i}} \right]\]

where \(E\left[ \cdot \right]\) is the expectation operator and \(P\left[ {X = {x_i}} \right]\) is the probability function for X.

In the coin-tossing example, this probability function is simply: \[P\left[ {{X_i} = {x_i}} \right] = \left\{ {\begin{array}{*{20}{c}} \theta &{{\rm{if}}\quad {x_i} = 1}\\ {\left( {1 – \theta } \right)}&{{\rm{if}}\quad {x_i} = 0} \end{array}} \right.\]

This can be written more compactly as: \[P\left[ {{X_i} = {x_i}} \right] = \begin{array}{*{20}{c}} {{\theta ^{{x_i}}}{\kern 1pt} {{\left( {1 – \theta } \right)}^{1 – {x_i}}}}&{{x_i} = \left\{ {0,1} \right\}} \end{array}\]

This is called the bernoulli distribution.

So, what’s the expected value of this random variable? Well, according to the formula above it’s: \[E\left[ {{X_i}} \right] = \sum\limits_i {{x_i}} \,P\left[ {{X_i} = {x_i}} \right] = \sum\limits_{x = \left\{ {0,1} \right\}} {x{\kern 1pt} } {\kern 1pt} {\theta ^x}{\kern 1pt} {\left( {1 – \theta } \right)^{1 – x}}\]

\[\begin{array}{l} = \left\{ {0{\kern 1pt} \cdot {\kern 1pt} {\theta ^0}{\kern 1pt} {{\left( {1 – \theta } \right)}^{1 – 0}}\, + \,1{\kern 1pt} \cdot {\kern 1pt} {\theta ^1}{\kern 1pt} {{\left( {1 – \theta } \right)}^{1 – 1}}} \right\}\\ = \left\{ {0{\kern 1pt} \, + \,\theta } \right\}\\ = \theta \end{array}\]

So now we can look at the expected value of our estimator \({\hat p}\):\[E\left[ {\hat p} \right] = E\left[ {\frac{{\sum\limits_{i = 1}^n {{X_i}} }}{n}} \right]\] \[\begin{array}{l} = \frac{1}{n}E\left[ {\sum\limits_{i = 1}^n {{X_i}} } \right]\\ = \frac{1}{n}\left\{ {\sum\limits_{i = 1}^n {E\left[ {{X_i}} \right]} } \right\}\\ = \frac{1}{n}\left\{ {\sum\limits_{i = 1}^n \theta } \right\}\\ = \frac{1}{n}\,n \cdot \theta \\ = \theta \end{array}\]

Perfect! This says that \({\hat p}\) is an unbiassed estimator of \(\theta\) – which confirms what the empirical plot above was telling us.

We’ll defer discussion about the variance properties of this estimator for the moment, and move on to the concept of likelihood.


The probability \(P\left[ {{X_i} = {x_i}} \right]\) is for an individual outcome. What can we say about the joint probability for outcomes i and j?

Well if (as we assume) the individual outcomes are independent (that is, the occurrence of a head/tail on the \({i^{th}}\) toss in no way influences the occurrence of a head/tail on the \({j^{th}}\) toss) than we can use the multiplication rule for probabilities to obtain the joint probability \(P\left[ {{X_i} = {x_i};{X_j} = {x_j}} \right]\) where \({{x_i} = \left\{ {0,1} \right\}}\) and \({{x_j} = \left\{ {0,1} \right\}}\).

Outcome of \({i^{th}}\) toss, \({X_i}\) Outcome of \({j^{th}}\) toss, \({X_j}\) \(Y = {X_i} + {X_j}\) \(P\left( Y \right)\)
0 0 0 \({\left( {1 – \theta } \right)^2}\)
1 0 1 \(\theta \left( {1 – \theta } \right)\)
0 1 1 \(\left( {1 – \theta } \right)\theta\)
1 1 2 \({\theta ^2}\)


You can verify that \(P\left( Y \right)\) is a true probability function by showing that the probabilities sum to one viz:

\[\begin{array}{l} {\left( {1 – \theta } \right)^2} + \theta \left( {1 – \theta } \right) + \left( {1 – \theta } \right)\theta + {\theta ^2} = {\left( {1 – \theta } \right)^2} + 2\theta \left( {1 – \theta } \right) + {\theta ^2}\\ = {\left\{ {\left( {1 – \theta } \right)\, + \,\theta } \right\}^2}\\ = 1 \end{array}\]

Observe, that for this simple case corresponding to n=2 we can write:

\[\begin{array}{*{20}{c}} {P\left( {Y = y} \right) = \left( {\begin{array}{*{20}{c}} 2\\ y \end{array}} \right){\theta ^y}\,{{\left( {1 – \theta } \right)}^{2 – y}}}&{;y = \left\{ {0,1,2} \right\}} \end{array}\]

This is a special case of the binomial distribution with n=2. More generally, we write:\[Y \sim bin\left( {n,\theta } \right)\] whcih is read as Y has a binomial distribution with parameters n and \(\theta\) and the formual generalizes to:\[\begin{array}{*{20}{c}} {P\left( {Y = y} \right) = \left( {\begin{array}{*{20}{c}} n\\ y \end{array}} \right){\theta ^y}{\mkern 1mu} {{\left( {1 – \theta } \right)}^{n – y}}}&{;y = \left\{ {0,1,2, \ldots ,n} \right\}} \end{array}\]

Now, let us return to our estimator \(\hat p = \frac{Y}{n}\) where \(Y = \sum\limits_{i = 1}^n {{X_i}}\).

We know that the probability model for Y is the binomial distribution: \[\begin{array}{*{20}{c}} {P\left( {Y = y} \right) = \left( {\begin{array}{*{20}{c}} n\\ y \end{array}} \right){\theta ^y}{\mkern 1mu} {{\left( {1 – \theta } \right)}^{n – y}}}&{;y = \left\{ {0,1,2, \ldots ,n} \right\}} \end{array}\] which, for a fixed n is a function of the parameter \(\theta\).

Now, given an observation \(Y = y\) for a fixed n, we would like to know what’s the most likely value for \(\theta\). To make this more concrete, suppose n=20 in the coin tossing example above. Let’s look at the outcomes of the 20 individual tosses and the sum, Y from the simulation:

## Outcome for 20 tosses 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1
## Y =  7

Thus, we have observed the event \(Y = 7\) with \(n = 20\) and so \(P\left( {Y = 7} \right) = \left( {\begin{array}{*{20}{c}}{20}\\7\end{array}} \right){\theta ^7}{\mkern 1mu} {\left( {1 – \theta } \right)^{13}}\).

This function (while technically still just a probability function) is called the likelihood function for \(\theta\). Let’s compute and plot it for various values of \(\theta\):


This plot has a distinct peak at 0.35. So, under these circumstances, our best estimate of \(\theta\) is 0.35 because it maximizes the likelihood function. In other words, the likelihood of the outcome we actually did observe is maximized for \(\theta = 0.35\). Not surprisingly, this is called the maximum likelihood estimate or mle of \(\theta\)!

Just one final step – we’ve deduced the mle for the case Y=7 and n=20, but what about more generally (i.e. any n and any Y?).

Well, we have the likelihood function:\[L\left( \theta \right) = \left( {\begin{array}{*{20}{c}} n\\ y \end{array}} \right){\theta ^y}{\mkern 1mu} {\left( {1 – \theta } \right)^{n – y}}\]

and we can simply differentiate this with respect to \(\theta\), set the result equal to zero, and solve for \(\theta\), that is we solve the equation \({\frac{{dL\left( \theta \right)}}{{d\theta }} = 0}\).

Now, \[\frac{d}{{d\theta }}\left\{ {\left( {\begin{array}{*{20}{c}} n\\ y \end{array}} \right){\theta ^y}{\mkern 1mu} {{\left( {1 – \theta } \right)}^{n – y}}} \right\} = \left( {\begin{array}{*{20}{c}} n\\ y \end{array}} \right)\frac{{{\theta ^{y – 1}}{{\left( {1 – \theta } \right)}^{n – y}}\left( {y – n\theta } \right)}}{{\left( {1 – \theta } \right)}}\]

and setting this equal to zero reduces to \({\theta ^{y – 1}}{\left( {1 – \theta } \right)^{n – y}}\left( {y – n\theta } \right) = 0\).

There are three possible solutions to this expression. They are: \(\begin{array}{*{20}{c}}{\theta = 0;}&{\left( {1 – \theta } \right) = 0;}&{y – n\theta = 0}\end{array}\). The first two are impausible because \({\theta = 0}\) means that there is no chance of observing a head while \({\left( {1 – \theta } \right) = 0}\) or \({\theta = 1}\) means that all we would ever observe is a head – both of which are refuted by our evidence (7 heads and 13 tails). The third alternative is \({y – n\theta = 0}\) from which we obtain the general expression for the maximum likelihood estimator for a binomial proportion:

\[\hat \theta = \frac{Y}{n}\]

And with n=20 and Y=7 this gives \(\hat \theta = \frac{7}{{20}} = 0.35\).

The continuous case

The binomial distribution considered above is an example of a discrete probability distribution. We next turn our attention to the estimation of a parameter of a continuous distribution. As before, we will do this in relation to a specific example.

The (negative) exponential distribution

The negative exponential distribution is a common probability model associated with ‘waiting times’ – for example, time spent at a supermarket checkout. In industrial settings it often provides an appropriate description of the time between failure of equipment or processes.

The probability density function (pdf) for the negative expnential distribution is:\[\begin{array}{*{20}{c}} {{f_X}(x) = \lambda {e^{ – \lambda x}};}&{x > 0;\quad \lambda > 0} \end{array}\]

As before, we consider a random sample \(\left\{ {{X_1},{X_2}, \ldots ,{X_n}} \right\}\) from this pdf. That is, the \(X’s\) are independently and identically distributed (i.i.d.) with pdf given by the expression above.

To obtain the mle for \(\lambda\) we first obtain the likelihood function: \[\begin{array}{l} L\left( \lambda \right) = \prod\limits_{i = 1}^n {{f_{{X_i}}}\left( {{x_i}} \right)} \\ = \prod\limits_{i = 1}^n {\lambda {e^{ – \lambda {x_i}}}} \\ = {\lambda ^n}{e^{ – \lambda \sum\limits_{i = 1}^n {{x_i}} }} \end{array}\]

As before, the mle is found by differentiating the likelihood function, equating to zero and solving in terms of the parameter (\(\lambda\)). Alternatively (and equivalently) we can differentiate the logarithm of the likelihood function. This is often an easier expression to differentiate and makes no difference since the value of \(\lambda\) which maximizes the likelihood function is the same as the value of \(\lambda\) that maximizes the log-likelihood function.

So, taking (natural) logarithms we have the following for the log-likelihood:\[\ln \left[ {L\left( \lambda \right)} \right] = l\left( \lambda \right) = n\ln \left( \lambda \right) – \lambda \sum\limits_{i = 1}^n {{x_i}} \]


\[\begin{array}{l} \begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {\frac{{dl\left( \lambda \right)}}{{d\lambda }} = 0}& \Rightarrow \end{array}}&{\frac{n}{\lambda }} \end{array} – \sum\limits_{i = 1}^n {{x_i}} = 0\\ \therefore \hat \lambda = \frac{n}{{\sum\limits_{i = 1}^n {{x_i}} }} = \frac{1}{{\bar X}} \end{array}\]

So, the mle for the parameter of the negative exponential distribution is the reciprocal of the sample mean.

Final remark

MLEs are generally the preferred method of estimation in statistics (at least classical or Frequentist statistics). There are several reasons for this which we won’t discuss here except to say that MLEs have a number of desirable properties not shared by other estimators. For example, the distribution of an MLE is asymptotically normal (i.e. tends to approach normality as the sample size increases). Another is that MLEs have an invariance property which means that if \({\hat \theta }\) is the mle for the parameter \(\theta\) then \(g\left( {\hat \theta } \right)\) is the mle for \(g\left( \theta \right)\) for some function \(g\left( \cdot \right)\) having a single-valued inverse.

So for example,for the negative exponential distribution just examined, \(\hat \lambda = \frac{1}{{\bar X}}\) and so the mle for \(\frac{1}{\lambda }\) is simply \({\bar X}\).