# Likelihood and Estimation

*Professor David R. Fox*

*March 13, 2019*

# Introduction

**Statistical Inference** is concerned with two *inter-related* issues: 1) **estimation** (of an unknown parameter); and 2) **hypothesis testing**.

This note is concerned with the first of these. We’ll set the scene with a simple example.

### Example : Is the coin fair?

Suppose you are given a coin and all that you are told is that it is biased. That is, the chances of landing ‘heads’ or ‘tails’ is not the same. Your problem is to estimate the *true* probability of observing either a head or a tail.

A classically ‘Frequentist’ approach to this problem is to flip the coin a number of times and record the number of heads and tails. The *relative frequency* with which heads/tails is observed becomes the estimate of the *true* probability of heads/tails. This simple example succinctly illustrates the concept of a relative frequency as a surrogate for a probability (and hence the name *frequentist statistics*). Furthermore, intuition suggests that the *precision* of this estimate will increase as the number of tosses of the coin increases.

We can demonstrate this quite easily with a small simulation in `R`

. There are many ways of doing this, but we’ll use basic functions rather than intrinsic `R`

probability functions.

```
theta<-0.36 # arbitarily set the true probability of a head
N<-1000 # set the total number of tosses
phat<-cumsum(as.numeric(runif(N)<=theta))/seq(1,N)
# this tallies the number of times we obtain a "1" and divides by the number of tosses up to that point
plot(seq(1,N),phat,ylim=c(0,1),type="l",xlab="Number of tosses",ylab="Rel. Freq.")
abline(h=theta,col="red")
```

As can be seen, the relative frequency improves the more times we toss the coin and indeed, eventaully converges to the correct value.

Thus, our intuitive estimator: \(\hat p = \frac{{\sum\limits_{i = 1}^n {{X_i}} }}{n}\) (where \({{X_i}}\) is 1 if the \({i^{th}}\) toss lands heads and zero otherwise) for the *true* probability, \(\theta\) seems to work satisfactorily. But is it the ‘*best*’ estimator for \(\theta\)?

Well, it depends on how you define *best*. Statisticians generally invoke two principles to identify a *best* estimator: 1) **unbiasedness**; *and* 2) **minimum variance**.

The first of these is used to narrow the search among all candidate estimators to only those that result in the *unbiassed estimation* of the unknown parameter. By this we mean that, in the long run, the estimator will neither over-estimate nor under-estimate the true value. More formally we say the *expected value* of the estimator is the *true value*.

Be careful here, the terminology *expected value* in statistics has a precise mathematical definition which may not coincide with what we “expect”. The **expected value** (of a random variable) is the *mean* or *arithmetic average*.

For a **discrete random variable** as is our case here; we write:

\[E\left[ X \right] = \sum\limits_i {{x_i}} \,P\left[ {X = {x_i}} \right]\]

where \(E\left[ \cdot \right]\) is the **expectation operator** and \(P\left[ {X = {x_i}} \right]\) is the **probability function** for *X*.

In the coin-tossing example, this probability function is simply: \[P\left[ {{X_i} = {x_i}} \right] = \left\{ {\begin{array}{*{20}{c}} \theta &{{\rm{if}}\quad {x_i} = 1}\\ {\left( {1 – \theta } \right)}&{{\rm{if}}\quad {x_i} = 0} \end{array}} \right.\]

This can be written more compactly as: \[P\left[ {{X_i} = {x_i}} \right] = \begin{array}{*{20}{c}} {{\theta ^{{x_i}}}{\kern 1pt} {{\left( {1 – \theta } \right)}^{1 – {x_i}}}}&{{x_i} = \left\{ {0,1} \right\}} \end{array}\]

This is called the **bernoulli distribution**.

So, what’s the *expected value* of this random variable? Well, according to the formula above it’s: \[E\left[ {{X_i}} \right] = \sum\limits_i {{x_i}} \,P\left[ {{X_i} = {x_i}} \right] = \sum\limits_{x = \left\{ {0,1} \right\}} {x{\kern 1pt} } {\kern 1pt} {\theta ^x}{\kern 1pt} {\left( {1 – \theta } \right)^{1 – x}}\]

\[\begin{array}{l} = \left\{ {0{\kern 1pt} \cdot {\kern 1pt} {\theta ^0}{\kern 1pt} {{\left( {1 – \theta } \right)}^{1 – 0}}\, + \,1{\kern 1pt} \cdot {\kern 1pt} {\theta ^1}{\kern 1pt} {{\left( {1 – \theta } \right)}^{1 – 1}}} \right\}\\ = \left\{ {0{\kern 1pt} \, + \,\theta } \right\}\\ = \theta \end{array}\]

So now we can look at the expected value of our estimator \({\hat p}\):\[E\left[ {\hat p} \right] = E\left[ {\frac{{\sum\limits_{i = 1}^n {{X_i}} }}{n}} \right]\] \[\begin{array}{l} = \frac{1}{n}E\left[ {\sum\limits_{i = 1}^n {{X_i}} } \right]\\ = \frac{1}{n}\left\{ {\sum\limits_{i = 1}^n {E\left[ {{X_i}} \right]} } \right\}\\ = \frac{1}{n}\left\{ {\sum\limits_{i = 1}^n \theta } \right\}\\ = \frac{1}{n}\,n \cdot \theta \\ = \theta \end{array}\]

Perfect! This says that \({\hat p}\) is an *unbiassed estimator* of \(\theta\) – which confirms what the empirical plot above was telling us.

We’ll defer discussion about the *variance* properties of this estimator for the moment, and move on to the concept of **likelihood**.

# Likelihood

The probability \(P\left[ {{X_i} = {x_i}} \right]\) is for an *individual outcome*. What can we say about the **joint probability** for outcomes *i* and *j*?

Well if (as we assume) the individual outcomes are **independent** (that is, the occurrence of a head/tail on the \({i^{th}}\) toss in no way influences the occurrence of a head/tail on the \({j^{th}}\) toss) than we can use the **multiplication rule** for probabilities to obtain the *joint probability* \(P\left[ {{X_i} = {x_i};{X_j} = {x_j}} \right]\) where \({{x_i} = \left\{ {0,1} \right\}}\) and \({{x_j} = \left\{ {0,1} \right\}}\).

Outcome of \({i^{th}}\) toss, \({X_i}\) | Outcome of \({j^{th}}\) toss, \({X_j}\) | \(Y = {X_i} + {X_j}\) | \(P\left( Y \right)\) |
---|---|---|---|

0 | 0 | 0 | \({\left( {1 – \theta } \right)^2}\) |

1 | 0 | 1 | \(\theta \left( {1 – \theta } \right)\) |

0 | 1 | 1 | \(\left( {1 – \theta } \right)\theta\) |

1 | 1 | 2 | \({\theta ^2}\) |

#### Aside

You can verify that \(P\left( Y \right)\) is a true probability function by showing that the probabilities sum to one *viz*:

\[\begin{array}{l} {\left( {1 – \theta } \right)^2} + \theta \left( {1 – \theta } \right) + \left( {1 – \theta } \right)\theta + {\theta ^2} = {\left( {1 – \theta } \right)^2} + 2\theta \left( {1 – \theta } \right) + {\theta ^2}\\ = {\left\{ {\left( {1 – \theta } \right)\, + \,\theta } \right\}^2}\\ = 1 \end{array}\]

Observe, that for this simple case corresponding to *n=2* we can write:

\[\begin{array}{*{20}{c}} {P\left( {Y = y} \right) = \left( {\begin{array}{*{20}{c}} 2\\ y \end{array}} \right){\theta ^y}\,{{\left( {1 – \theta } \right)}^{2 – y}}}&{;y = \left\{ {0,1,2} \right\}} \end{array}\]

This is a special case of the **binomial distribution** with *n=2*. More generally, we write:\[Y \sim bin\left( {n,\theta } \right)\] whcih is read as *Y has a binomial distribution with parameters n and* \(\theta\) and the formual generalizes to:\[\begin{array}{*{20}{c}}
{P\left( {Y = y} \right) = \left( {\begin{array}{*{20}{c}}
n\\
y
\end{array}} \right){\theta ^y}{\mkern 1mu} {{\left( {1 – \theta } \right)}^{n – y}}}&{;y = \left\{ {0,1,2, \ldots ,n} \right\}}
\end{array}\]

Now, let us return to our estimator \(\hat p = \frac{Y}{n}\) where \(Y = \sum\limits_{i = 1}^n {{X_i}}\).

We know that the probability model for *Y* is the *binomial distribution*: \[\begin{array}{*{20}{c}}
{P\left( {Y = y} \right) = \left( {\begin{array}{*{20}{c}}
n\\
y
\end{array}} \right){\theta ^y}{\mkern 1mu} {{\left( {1 – \theta } \right)}^{n – y}}}&{;y = \left\{ {0,1,2, \ldots ,n} \right\}}
\end{array}\] which, for a fixed *n* is a function of the parameter \(\theta\).

Now, *given* an observation \(Y = y\) for a fixed *n*, we would like to know what’s the **most likely** value for \(\theta\). To make this more concrete, suppose *n=20* in the coin tossing example above. Let’s look at the outcomes of the 20 individual tosses and the sum, *Y* from the simulation:

`## Outcome for 20 tosses 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1`

`## Y = 7`

Thus, we have observed the event \(Y = 7\) with \(n = 20\) and so \(P\left( {Y = 7} \right) = \left( {\begin{array}{*{20}{c}}{20}\\7\end{array}} \right){\theta ^7}{\mkern 1mu} {\left( {1 – \theta } \right)^{13}}\).

This function (while technically still just a probability function) is called the **likelihood function** for \(\theta\). Let’s compute and plot it for various values of \(\theta\):

```
L<-function(theta){
return(choose(20,7)*theta^7*(1-theta)^13)
}
p<-seq(0,1,by=0.005)
plot(p,L(p),type="l",xlab="theta",ylab="Likelihood")
```

This plot has a distinct peak at 0.35. So, under these circumstances, our *best* estimate of \(\theta\) is 0.35 because it maximizes the *likelihood function*. In other words, the likelihood of the outcome *we actually did observe* is maximized for \(\theta = 0.35\). Not surprisingly, this is called the **maximum likelihood estimate** or *mle* of \(\theta\)!

Just one final step – we’ve deduced the *mle* for the case *Y=7* and *n=20*, but what about more generally (i.e. any *n* and *any Y*?).

Well, we have the likelihood function:\[L\left( \theta \right) = \left( {\begin{array}{*{20}{c}} n\\ y \end{array}} \right){\theta ^y}{\mkern 1mu} {\left( {1 – \theta } \right)^{n – y}}\]

and we can simply differentiate this with respect to \(\theta\), set the result equal to zero, and solve for \(\theta\), that is we solve the equation \({\frac{{dL\left( \theta \right)}}{{d\theta }} = 0}\).

Now, \[\frac{d}{{d\theta }}\left\{ {\left( {\begin{array}{*{20}{c}} n\\ y \end{array}} \right){\theta ^y}{\mkern 1mu} {{\left( {1 – \theta } \right)}^{n – y}}} \right\} = \left( {\begin{array}{*{20}{c}} n\\ y \end{array}} \right)\frac{{{\theta ^{y – 1}}{{\left( {1 – \theta } \right)}^{n – y}}\left( {y – n\theta } \right)}}{{\left( {1 – \theta } \right)}}\]

and setting this equal to zero reduces to \({\theta ^{y – 1}}{\left( {1 – \theta } \right)^{n – y}}\left( {y – n\theta } \right) = 0\).

There are three possible solutions to this expression. They are: \(\begin{array}{*{20}{c}}{\theta = 0;}&{\left( {1 – \theta } \right) = 0;}&{y – n\theta = 0}\end{array}\). The first two are impausible because \({\theta = 0}\) means that there is no chance of observing a head while \({\left( {1 – \theta } \right) = 0}\) or \({\theta = 1}\) means that all we would ever observe is a head – both of which are refuted by our evidence (7 heads and 13 tails). The third alternative is \({y – n\theta = 0}\) from which we obtain the general expression for the maximum likelihood estimator for a binomial proportion:

\[\hat \theta = \frac{Y}{n}\]

And with *n=20* and *Y=7* this gives \(\hat \theta = \frac{7}{{20}} = 0.35\).

# The continuous case

The *binomial distribution* considered above is an example of a **discrete probability distribution**. We next turn our attention to the estimation of a parameter of a **continuous distribution**. As before, we will do this in relation to a specific example.

### The (negative) exponential distribution

The negative exponential distribution is a common probability model associated with ‘waiting times’ – for example, time spent at a supermarket checkout. In industrial settings it often provides an appropriate description of the *time between failure* of equipment or processes.

The *probability density function* (pdf) for the negative expnential distribution is:\[\begin{array}{*{20}{c}}
{{f_X}(x) = \lambda {e^{ – \lambda x}};}&{x > 0;\quad \lambda > 0}
\end{array}\]

As before, we consider a *random sample* \(\left\{ {{X_1},{X_2}, \ldots ,{X_n}} \right\}\) from this *pdf*. That is, the \(X’s\) are *independently* and *identically* distributed *(i.i.d.)* with *pdf* given by the expression above.

To obtain the *mle* for \(\lambda\) we first obtain the *likelihood function*: \[\begin{array}{l}
L\left( \lambda \right) = \prod\limits_{i = 1}^n {{f_{{X_i}}}\left( {{x_i}} \right)} \\
= \prod\limits_{i = 1}^n {\lambda {e^{ – \lambda {x_i}}}} \\
= {\lambda ^n}{e^{ – \lambda \sum\limits_{i = 1}^n {{x_i}} }}
\end{array}\]

As before, the *mle* is found by differentiating the likelihood function, equating to zero and solving in terms of the parameter (\(\lambda\)). Alternatively (and equivalently) we can differentiate the *logarithm* of the likelihood function. This is often an easier expression to differentiate and makes no difference since the value of \(\lambda\) which maximizes the likelihood function is the same as the value of \(\lambda\) that maximizes the *log-likelihood* function.

So, taking (natural) logarithms we have the following for the *log-likelihood*:\[\ln \left[ {L\left( \lambda \right)} \right] = l\left( \lambda \right) = n\ln \left( \lambda \right) – \lambda \sum\limits_{i = 1}^n {{x_i}} \]

and

\[\begin{array}{l} \begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {\frac{{dl\left( \lambda \right)}}{{d\lambda }} = 0}& \Rightarrow \end{array}}&{\frac{n}{\lambda }} \end{array} – \sum\limits_{i = 1}^n {{x_i}} = 0\\ \therefore \hat \lambda = \frac{n}{{\sum\limits_{i = 1}^n {{x_i}} }} = \frac{1}{{\bar X}} \end{array}\]

So, the *mle* for the parameter of the negative exponential distribution is the reciprocal of the sample mean.

#### Final remark

*MLEs* are generally the preferred method of estimation in statistics (at least *classical* or *Frequentist* statistics). There are several reasons for this which we won’t discuss here except to say that *MLEs* have a number of desirable properties not shared by other estimators. For example, the distribution of an *MLE* is asymptotically normal (i.e. tends to approach normality as the sample size increases). Another is that *MLEs* have an **invariance property** which means that if \({\hat \theta }\) is the *mle* for the parameter \(\theta\) then \(g\left( {\hat \theta } \right)\) is the *mle* for \(g\left( \theta \right)\) for some function \(g\left( \cdot \right)\) having a *single-valued inverse*.

So for example,for the negative exponential distribution just examined, \(\hat \lambda = \frac{1}{{\bar X}}\) and so the *mle* for \(\frac{1}{\lambda }\) is simply \({\bar X}\).