Introduction to Matrices

Introduction

A matrix is just a two-dimensional array (usually of numbers). For example:

\[A = \left( {\begin{array}{*{20}{c}} 2&4\\ { – 1}&3 \end{array}} \right)\]

The dimensions of an array are always specified as number of rows x number of columns. So, the matrix A above is a 2 x 2 matrix and matrix B below is a 2 x 3 matrix.

\[B = \left( {\begin{array}{*{20}{c}} 1&0&{ – 1}\\ 0&2&3 \end{array}} \right)\]

If either the row dimension or the column dimension is one, then it is customary to refer to the matrix as either a row vector or column vector respectively. Thus, matrix B consists of the 2 row vectors \({b_1}\) and \({b_2}\) : \({b_1} = \left( {\begin{array}{*{20}{c}}1&0&{ – 1}\end{array}} \right)\) and \({b_2} = \left( {\begin{array}{*{20}{c}}0&2&3\end{array}} \right)\)

Matrix arithmetic

The usual mathematical operations of addition, subtraction, multiplication and ‘division’ apply to matrices – although not always in entirely obvious ways.

Matrix addition and subtraction

Two matrics A and B can only be multiplied if they have matching dimensions, in which case we say they are conformable for addition/subtraction.

Matrix addition and subtraction are elementwise operations. That is, the result of adding or subtracting two matrices is obtained by simply adding or subtracting the corresponding elements in each matrix. Thus:

\[\left( {\begin{array}{*{20}{c}} 2&4\\ { – 1}&3 \end{array}} \right) + \left( {\begin{array}{*{20}{c}} 4&{ – 2}\\ 3&1 \end{array}} \right) = \left( {\begin{array}{*{20}{c}} 6&2\\ 2&4 \end{array}} \right)\]

and

\[\left( {\begin{array}{*{20}{c}} 2&4\\ { – 1}&3 \end{array}} \right) – \left( {\begin{array}{*{20}{c}} 4&{ – 2}\\ 3&1 \end{array}} \right) = \left( {\begin{array}{*{20}{c}} { – 2}&6\\ { – 4}&2 \end{array}} \right)\]

Matrix multiplication

Multiplication for matrices is not an elementwise operation. Firstly, two matrices A and B are conformable for multiplication only if the number of columns of the matrix on the left is equal to the number of rows of the matrix on the right. The other dimensions (ie. number of rows of the matrix on the left and the number columns of the matrix on the right can be anyting). Thus, an \((m{\rm{ x }}k)\) and only be multiplied with a\((k{\rm{ x }}s)\) matrix and the result is an a\((m{\rm{ x }}s)\) matrix.

\[\left( {\begin{array}{*{20}{c}} {{a_{11}}}& \ldots &{{a_{1n}}}\\ \vdots & \ddots & \vdots \\ {{a_{m1}}}& \cdots &{{a_{mn}}} \end{array}} \right){\rm{ x }}\left( {\begin{array}{*{20}{c}} {{b_{11}}}& \ldots &{{b_{1s}}}\\ \vdots & \ddots & \vdots \\ {{a_{n1}}}& \cdots &{{b_{ns}}} \end{array}} \right){\rm{ = }}\left( {\begin{array}{*{20}{c}} {\sum\limits_{i = 1}^n {{a_{1i}}{b_{i1}}} }& \ldots &{\sum\limits_{i = 1}^n {{a_{1i}}{b_{is}}} }\\ \vdots & \ddots & \vdots \\ {\sum\limits_{i = 1}^n {{a_{mi}}{b_{i1}}} }& \cdots &{\sum\limits_{i = 1}^n {{a_{mi}}{b_{is}}} } \end{array}} \right)\]

The easiest way to remember this is that the element appearing in the \({i^{th}}\) row and \({j^{th}}\) column of the product is obtained by multiplying the elements in the \({i^{th}}\) row of the matrix on th left with the elements in \({j^{th}}\) column of the matrix on the right and summing. For example, the element in the first row and first column of the product is visualized as:

Example

\[\left( {\begin{array}{*{20}{c}} 2&{ – 1}\\ 3&1 \end{array}} \right){\rm{ x }}\left( {\begin{array}{*{20}{c}} 1&0&{ – 1}\\ 2&{ – 1}&3 \end{array}} \right){\rm{ = }}\left( {\begin{array}{*{20}{c}} 0&1&{ – 5}\\ 5&{ – 1}&0 \end{array}} \right)\]

At this point you may be wondering how we ‘divide’ matrices. The answer is – we don’t. But there is an operation that effectively achieves the equivalent of division in normal arithmetic. Before discussing this, we need to introduce some special matrices and related concepts.

Special matrices and operations

Multiplication by a scalar

Multiplication of a matrix A by a scalar (ie. single number), c is straightforward and is performed by simply multiplying each of the individual elements of the matrix by c. Thus:

\[c \cdot \left( {\begin{array}{*{20}{c}} {{a_{11}}}&{{a_{12}}}&{{a_{13}}}\\ {{a_{21}}}&{{a_{22}}}&{{a_{23}}}\\ {{a_{31}}}&{{a_{32}}}&{{a_{33}}} \end{array}} \right) = \left( {\begin{array}{*{20}{c}} {c \cdot {a_{11}}}&{c \cdot {a_{12}}}&{c \cdot {a_{13}}}\\ {c \cdot {a_{21}}}&{c \cdot {a_{22}}}&{c \cdot {a_{23}}}\\ {c \cdot {a_{31}}}&{c \cdot {a_{32}}}&{c \cdot {a_{33}}} \end{array}} \right)\]

Matrix equivalence

Two matrics A and B are equal if and only if: (i) they have identical dimensions and (ii) every element in B is the same as the corresponding element in A.

Matrix transpose

The transpose of the \(\left( {m{\rm{ x }}k} \right)\) matrix A is the \(\left( {k{\rm{ x }}m} \right)\) matrix obtained by making the rows(columns) of A the columns(rows) of the new matrix. We denote the transposed version of A as either \(A’\) or \({A^T}\). For example if \(A = \left( {\begin{array}{*{20}{c}}2&{ – 1}\\3&1\end{array}} \right)\) and \(B = \left( {\begin{array}{*{20}{c}}1&0&{ – 1}\\2&{ – 1}&3\end{array}} \right)\) then \({A^T} = \left({\begin{array}{*{20}{c}}2&3\\{ – 1}&1\end{array}} \right)\) and \({B^T} = \left( {\begin{array}{*{20}{c}}1&2\\0&{ – 1}\\{ – 1}&3\end{array}} \right)\).

Furthermore, if we look at the product \(A \cdot B = \left( {\begin{array}{*{20}{c}}0&1&{ – 5}\\5&{ – 1}&0\end{array}} \right)\) and take its transpose we have:

\[{\left( {A \cdot B} \right)^T} = \left( {\begin{array}{*{20}{c}} 0&5\\ 1&{ – 1}\\ { – 5}&0 \end{array}} \right)\]

and it is eassily verified that this is equivalent to \({B^T} \cdot {A^T}\).

While this doesn’t constitute a proof, it turns out that it holds more generally. Thus:

\[{\left( {A \cdot B} \right)^T} = {B^T} \cdot {A^T}\]

Symmetric matrices

A symmetric matrix is one whose transpose equals itself. That is, matrix A is symmetric if:

\[{A^T} = A\]

And by the equivalence of matrices (as stated above) A and its transpose are of the same dimensions, but this can only happen if A is a square matrix. That is, A has the same number of rows and columns and the off-diagonal elements match. For example:

\[A = \left( {\begin{array}{*{20}{c}} 1&{ – 1}&4\\ { – 1}&2&5\\ 4&5&3 \end{array}} \right)\]

is a symmetric matrix.

Diagonal matrices

A diagonal matrix is a sqaure matrix whose off-diagonal entries are all zero. For example:

\[A = \left( {\begin{array}{*{20}{c}} 2&0&0\\ 0&3&0\\ 0&0&{ – 2} \end{array}} \right)\]

is diagonal.

The identity matrix

A very special diagonal matrix is the identity matrix. This is simply a diagonal matrix whose diagonal entries are all unity (ie. 1). For example, the 3 x 3 identity matrix is:

\[{I_3} = \left( {\begin{array}{*{20}{c}} 1&0&0\\ 0&1&0\\ 0&0&1 \end{array}} \right)\]

The null matrix

A null matrix is any matrix whose elements are all zero.

Similarities and differences with ordinary arithmetic

The null and identity matrices work like the numbers zero and one in ordinary arithmetic. This is easy to see. For example, in ordinary arithmetic \(a – a = 0\).

The matrix equivalent is \(A – A\) which of course is a matrix of zeros – a null matrix.

In ordinary arithmetic, a number multiplied by 1 is unchanged. In matrix algebra, a matrix multiplied by a conformable identity matrix is similarly left unchanged. That is, \(A \cdot I = A\).

Furthermore, if A is a square matrix, then \(A \cdot I = I \cdot A = A\).

Some of the rules of ordinary arithmetic carry over to matrices, others do not. For example, matrix addition and subtraction are commutative (\(A \pm B = B \pm A\)) whereas, in general matrix multiplication is not (\(A \cdot B \ne B \cdot A\)). Matrix addition and multiplication are however, associatve. That is:

\[A + \left( {B + C} \right) = \left( {A + B} \right) + C\]

and

\[A \cdot \left( {B \cdot C} \right) = \left( {A \cdot B} \right) \cdot C\]

Powers of a matrix

In ordinary arithmetic,

\[{a^m} = \underbrace {a \cdot a \cdot a \cdot \ldots \cdot a}_{m{\rm{ times}}}\]

for integer m.

The same (almost) applies for matrices:

\[{A^m} = \underbrace {A \cdot A \cdot A \cdot \ldots \cdot A}_{m{\rm{ times}}}\]

but this can only work if A is square. For example: if

\[A = \left( {\begin{array}{*{20}{c}} 2&{ – 1}\\ 3&1 \end{array}} \right)\]

Then:

\[{A^2} = \left( {\begin{array}{*{20}{c}} 1&{ – 3}\\ 9&{ – 2} \end{array}} \right)\]

\[{A^3} = \left( {\begin{array}{*{20}{c}} { – 7}&{ – 4}\\ {12}&{ – 11} \end{array}} \right)\]

and so on.

You might be wondering if this works for negative exponents (eg: \({A^{ – 2}}\)) as well as fractional exponents (eg: \({A^{ – 0.5}}\) ). Turns out it does \({A^{ – 2}} = \frac{1}{{25}}\left( {\begin{array}{*{20}{c}}{ – 2}&3\\{ -9}&1\end{array}} \right)\)) and (\({A^{ – 0.5}} = \left( {\begin{array}{*{20}{c}}{0.5294321}&{0.1636035}\\{ -0.4908105}&{0.6930356}\end{array}} \right)\) but that’s beyond the scope of this introduction.

Matrix inverse

We said earlier that matrix ‘division’ is undefined. That’s not strictly true, it’s just that the equivalent process in matrix algebra is not called division – it’s called finding the inverse of a matrix.

The concept is easily understood if we again appeal to the process of division for ordinary numbers.

It is self-evident that for any number a, \(\frac{a}{a} = 1\). Another way of writing this is \(a \cdot \left( {\frac{1}{a}} \right) = 1\) or \(a \cdot {a^{ – 1}} = 1\).

In otherwords, dividing by the number a is the same as multiplying by its inverse, \({a^{ – 1}}\).

Extending this concept to matrices, the equivalent process of ‘dividing’ by a matrix would be to multiply by it’s inverse, \({A^{ – 1}}\). Trouble is, this is not straightforward. For example, in ordinary arithmetic the inverse of a number, c is its reciprocal \(\frac{1}{c}\). So, you might be tempted to think that we could find the inverse of a matrix by taking the reciprocals of all its elements. Alas – if things were that simple! You can prove in an instant this doesn’t work (except in a very special case) by trying it with any arbitary matrix. For example:

\[A = \left( {\begin{array}{*{20}{c}} 2&1\\ 4&5 \end{array}} \right)\]

We’ll take the reciprocal of all the elements of A and all this matrix B. Thus:

\[B = \left( {\begin{array}{*{20}{c}} {0.5}&1\\ {0.25}&{0.2} \end{array}} \right)\]

and if B was the inverse of A then we would have \(A \cdot B = I\). A quick calculation reveals this is not the case with \[A \cdot B = \left( {\begin{array}{*{20}{c}} {1.25}&{2.2}\\ {3.25}&{5.0} \end{array}} \right)\]

and

\[B \cdot A = \left( {\begin{array}{*{20}{c}} {5.0}&{5.5}\\ {1.3}&{1.25} \end{array}} \right)\]

Inverse of a 2 x 2 matrix

Computing matrix inverses is hard and best left to a computer and good software like R. However, we’ll show you how to manually compute the inverse of a 2 x 2 matrix.

Let \(A = \left( {\begin{array}{*{20}{c}}a&b\\c&d\end{array}} \right)\). The first thing we need to do is to compute the determinant of A. Without going into details, the determinant of our 2 x 2 matrix A (written \(\left| A \right|\)) is:

\[\det (A) = \left| A \right| = a \cdot d – b \cdot c\]

Now, the inverse of A will exist only if \(\left| A \right| \ne 0\) and is given by:

\[{A^{ – 1}} = \frac{1}{{\left| A \right|}}\left( {\begin{array}{*{20}{c}} d&{ – b}\\ { – c}&a \end{array}} \right)\]

You can easily prove this is correct by working out \(A \cdot {A^{ – 1}}\).

\[A \cdot {A^{ – 1}} = \left( {\begin{array}{*{20}{c}} a&b\\ c&d \end{array}} \right) \cdot \frac{1}{{\left| A \right|}}\left( {\begin{array}{*{20}{c}} d&{ – b}\\ { – c}&a \end{array}} \right)\]

\[ = \frac{1}{{\left| A \right|}} \cdot \left( {\begin{array}{*{20}{c}} a&b\\ c&d \end{array}} \right) \cdot \left( {\begin{array}{*{20}{c}} d&{ – b}\\ { – c}&a \end{array}} \right)\]

\[ = \frac{1}{{\left| A \right|}} \cdot \left( {\begin{array}{*{20}{c}} {a \cdot d – b \cdot c}&{ – a \cdot b + b \cdot a}\\ {d \cdot c – d \cdot c}&{a \cdot d – b \cdot c} \end{array}} \right)\]

\[ = \frac{1}{{\left( {a \cdot d – b \cdot c} \right)}} \cdot \left( {\begin{array}{*{20}{c}} {a \cdot d – b \cdot c}&0\\ {}&{a \cdot d – b \cdot c} \end{array}} \right)\]

\[ = \left( {\begin{array}{*{20}{c}} 1&0\\ 0&1 \end{array}} \right)\]

So it works!

Inverse of any matrix

It’s computationally difficult to compute matrix inverses for anything other than a 2 x 2 or maybe at most, a 3 x 3 matrix. Computers are far more adept at this task and for us, this means using R.

Let’s see how to do some basic matrix arithmetic in R.

A<-matrix(c(2,1,3,-1,4,-2,3,0,6),nrow=3,byrow=TRUE)
A
##      [,1] [,2] [,3]
## [1,]    2    1    3
## [2,]   -1    4   -2
## [3,]    3    0    6
###  Matrix transpose  ###
t(A)
##      [,1] [,2] [,3]
## [1,]    2   -1    3
## [2,]    1    4    0
## [3,]    3   -2    6
###  Matrix iverse  ###
solve(A)
##      [,1]  [,2]        [,3]
## [1,]    2 -0.50 -1.16666667
## [2,]    0  0.25  0.08333333
## [3,]   -1  0.25  0.75000000
### matrix multiplication  ###
A %*% solve(A)
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
###  Square of a matrix  ###
A %*% A
##      [,1] [,2] [,3]
## [1,]   12    6   22
## [2,]  -12   15  -23
## [3,]   24    3   45
###  Reciprocal of the elements of a matrix  ###
1/A
##            [,1] [,2]       [,3]
## [1,]  0.5000000 1.00  0.3333333
## [2,] -1.0000000 0.25 -0.5000000
## [3,]  0.3333333  Inf  0.1666667
###  Raise each element to a power  ###
A^2
##      [,1] [,2] [,3]
## [1,]    4    1    9
## [2,]    1   16    4
## [3,]    9    0   36

Inverse of a diagonal matrix

We noted earlier that taking the reciprocal of the elements of a matrix did not equate to finding its inverse – with one exception and that’s when the matrix is diagonal. In this case it is readily shown that the inverse of:

\[A = \left( {\begin{array}{*{20}{c}} {{a_{11}}}& \ldots &0\\ \vdots & \ddots & \vdots \\ 0& \cdots &{{a_{mn}}} \end{array}} \right)\]

\[{A^{ – 1}} = \left( {\begin{array}{*{20}{c}} {\frac{1}{{{a_{11}}}}}& \ldots &0\\ \vdots & \ddots & \vdots \\ 0& \cdots &{\frac{1}{{{a_{mn}}}}} \end{array}} \right)\]

Systems of linear equations

Matrix algebra is an extremely compact and efficient way of doing linear algebra. Consider the pair of simultaneous equations:

\[\begin{array}{l} 2x + 3y = 1\\ 3x – 2y = 8 \end{array}\]

Now, we all know how to solve this – we use one of the equations to obtain an expression for either x or y in terms of the other and then substitute this expression into the other equation which gives us one equation in one unknown and then solve for that unknown. So, for example, from the first equation we have \(x = \frac{1}{2}\left( {1 – 3y} \right)\) and if we substitute this into the second equation we have \[\frac{3}{2}\left( {1 – 3y} \right) – 2y = 8\] or \[3\left( {1 – 3y} \right) – 4y = 16\]

\[\begin{array}{l} \Rightarrow – 13y = 13\\ \therefore y = – 1 \end{array}\]

Substituting \(y = – 1\) into either equation allows to solve for x (\(x = 2\)).

It’s easy to set this up and solve using matrices. Thus

\[\begin{array}{l} 2x + 3y = 1\\ 3x – 2y = 8 \end{array}\]

can be written in the form \(A \cdot X = b\) where \[A = \left( {\begin{array}{*{20}{c}} 2&3\\ 3&{ – 2} \end{array}} \right)\] \[X = \left( {\begin{array}{*{20}{c}} x\\ y \end{array}} \right)\] and \[b = \left( {\begin{array}{*{20}{c}} 1\\ 8 \end{array}} \right)\].

To solve for X we pre-multiply both sides by A inverse, giving:

\[{A^{ – 1}} \cdot A \cdot X = {A^{ – 1}} \cdot b\] and we immediately have that

\[X = {A^{ – 1}} \cdot b\].

This is a one-line calculation in R:

A<-matrix(c(2,3,3,-2),nrow=2,byrow=TRUE)  #  matrix of coefficients
b<-c(1,8)  #  vecotor of rhs values
solve(A) %*% b  # solution
##      [,1]
## [1,]    2
## [2,]   -1

Ordinary Least Squares (OLS)

For better of worse, the real world is imperfect. Even when we know or expect a linear relationship between two variables y and x, a plot of empirical data invariably shows scatter with the points failing to lie on a perfect straight line.

So how do we determine the equation of the ‘best’ fitting line in such circumstances? We could do this by visual inspection and a ruler but that’s incredibly rudimentary and suffers pathological drawbacks such as: it’s subjective; lack of reproducability; no statement of precision or uncertainty in the estimated coefficients.

You no doubt are familiar with the statistical concept of linear regression. There are whole texts devoted to this topic, so our treatment here will be brief.

The statistical model for the simple (y versus x) regression model is:

\[{Y_i} = {\beta _0} + {\beta _1}{x_i} + {\varepsilon _i}\]

where \({\beta _0}\) and \({\beta _1}\) are the true but unknown regression parameters, \({x_i}\) is a value (assumed measured without error) of the covariate; and \({\varepsilon _i}\) is a random error associated with the \({i^{th}}\) value of the dependent variable \({Y_i}\).

Furthermore, it is customary to assume that the error terms are normally distributed with zero mean and constant variance. That is: \[{\varepsilon _i} \sim N(0,\sigma _\varepsilon ^2)\]

Returning to the question of “how do we identify the best fitting line” we first need to have a criterion by which we can assess the quality of any fitted line. You can probably think of many ways of doing this, but the universally accepted method is that ‘best’ is deemed to be the regression equation that minimizes the sum of the vertical deviations of the data points to the fitted line. This is known as the least squares criterion.

A quick search on the web or a look at an introductory statistics text will probably give the following computational formulae for the simple linear regression model:

\[{\hat \beta _1} = \frac{{\sum\limits_{i = 1}^n {\left( {{x_i} – \bar x} \right)} \left( {{y_i} – \bar y} \right)}}{{\sum\limits_{i = 1}^n {{{\left( {{x_i} – \bar x} \right)}^2}} }}\]

and

\[{{\hat \beta }_0} = \bar y – {{\hat \beta }_1}\bar x\]

This is fine and fairly easy to use. However, if we wish to fit more complex models (for example mutliple linear regression models involving k predictors \({x_1},{x_2}, \ldots ,{x_k}\)) then this approach become laborious.

At this point it is far more convenient to re-cast the problem in matrix notation.

OLS in matrix notation

Our multiple linear regression model:

\[{Y_i} = {\beta _0} + {\beta _1}{x_{1i}} + {\beta _2}{x_{2i}} + \ldots + {\beta _k}{x_{ki}} + {\varepsilon _i}\]

can be expressed in matrix notation as:

\[\underbrace {\left( {\begin{array}{*{20}{c}} {{y_1}}\\ {{y_2}}\\ \vdots \\ {{y_n}} \end{array}} \right)}_{(n{\rm{ x }}1)} = \underbrace {\left( {\begin{array}{*{20}{c}} 1&{{x_{11}}}&{{x_{21}}}& \cdots &{{x_{k1}}}\\ 1&{{x_{12}}}&{{x_{22}}}& \cdots &{{x_{k2}}}\\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1&{{x_{1n}}}&{{x_{2n}}}& \cdots &{{x_{kn}}} \end{array}} \right)}_{(n{\rm{ x k + 1}})}\underbrace {\left( {\begin{array}{*{20}{c}} {{\beta _0}}\\ {{\beta _1}}\\ \vdots \\ {{\beta _k}} \end{array}} \right)}_{(k + 1{\rm{ x }}1)} + \underbrace {\left( {\begin{array}{*{20}{c}} {{\varepsilon _1}}\\ {{\varepsilon _2}}\\ \vdots \\ {{\varepsilon _n}} \end{array}} \right)}_{(n{\rm{ x }}1)}\]

or simply as:

\[Y = XB + {\rm E}\]

where

\[Y = \left( {\begin{array}{*{20}{c}} {{y_1}}\\ {{y_2}}\\ \vdots \\ {{y_n}} \end{array}} \right)\]

\[X = \left( {\begin{array}{*{20}{c}} 1&{{x_{11}}}&{{x_{21}}}& \cdots &{{x_{k1}}}\\ 1&{{x_{12}}}&{{x_{22}}}& \cdots &{{x_{k2}}}\\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1&{{x_{1n}}}&{{x_{2n}}}& \cdots &{{x_{kn}}} \end{array}} \right)\]

\[B = \left( {\begin{array}{*{20}{c}} {{\beta _0}}\\ {{\beta _1}}\\ \vdots \\ {{\beta _k}} \end{array}} \right)\] \[E = \left( {\begin{array}{*{20}{c}} {{\varepsilon _1}}\\ {{\varepsilon _2}}\\ \vdots \\ {{\varepsilon _n}} \end{array}} \right)\]

Now, the OLS criterion can be expressed mathematically as: \({\rm{minimize }}\sum\limits_{i = 1}^n {{{\left( {{y_i} – {{\hat y}_i}} \right)}^2}}\) where \({{\hat y}_i}\) is a predicted value of Y at \(\left\{ {{x_{1i}},{x_{2i}}, \ldots ,{x_{ki}}} \right\}\) (ie. \({{\hat y}_i} = {{\hat \beta }_0} + {{\hat \beta }_1}{x_{1i}} + {{\hat \beta }_2}{x_{2i}} + \ldots + {{\hat \beta }_k}{x_{ki}}\)).

In matrix notation, this is: \[{\rm{minimize }}\left\{ {{{\left( {Y – \hat Y} \right)}^T}\left( {Y – \hat Y} \right)} \right\}\] or \[{\rm{minimize }}\left\{ {{{\left( {Y – X\hat B} \right)}^T}\left( {Y – X\hat B} \right)} \right\}\]

If we multiply out the last expression we obtain: \[{\rm{minimize }}\left\{ {{Y^T}Y – 2\hat B{X^T}Y + {{\hat B}^T}{X^T}X\hat B} \right\}\]

Although beyond the scope of this introduction, just like mathematical optimisation using scalars, we solve this minimisation by differentiating with respect to \({\hat B}\), setting the result to zero, and solving for \({\hat B}\).

Differentiating with respect to \({\hat B}\) and equating to zero gives us:

\[ – 2{X^T}Y + 2{X^T}X\hat B = 0\]

from which we can easily solve for \({\hat B}\):

\[\hat B = {\left( {{X^T}X} \right)^{ – 1}}{X^T}Y\]

This is just the start – we then go on to make inferences about the regression parameters \({\hat B}\) and the fitted model. Although this is where our introduction finishes, a very important result that allows this inference to proceed is:

\[Cov\left( {\hat B} \right) = {\left( {{X^T}X} \right)^{ – 1}}\,\sigma _\varepsilon ^2\]