This teach file extends FCS1 and FCS2 by giving a basic introduction to the use of matrices.
In the first two teach files, every variable that was used stood for a single real number. This included subscripted variables: something like x[i] meant the i'th value in a collection, not the whole collection. This made it possible to interpret every equation according to the normal rules of arithmetic; addition, multiplication, differentiation and so on all had their ordinary elementary meanings, even if the context in which they were used involved, in principle, large numbers of variables.
Sometimes it is useful to be able to manipulate symbols that stand for structured collections of numbers. Matrices are a case of this, and their use is common in a wide variety of fields. Here, it will be illustrated by referring back to the work on neural networks in FCS2.
Discussion of this topic in textbooks crosses the A-level/degree level boundary. Books such as that by Boas (see FCS1) give a discussion, and also contain much background and related material - not all necessarily relevant at present. An outstanding reference book is "Matrix Analysis", by R.A. Horn and C.R. Johnson (Cambridge University Press 1985, later reprints), but this gives an advanced treatment that will only be useful if you already have a fairly mathematical background.
Matrix manipulation is, par excellence, an area in which high-quality software libraries and packages have liberated ordinary users from the need to be very familiar with a lot of algorithmic detail. Thus you will find books such as "Numerical Recipes in C", by W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T. Vetterling (Cambridge University Press 1988, later reprints) contain summary discussions and much practical information. The Matlab package, although wider in scope now than matrix manipulation, has particular strength in this area.
The remainder of this file summarises some of the central ideas. As usual, it does not purport to replace the examples and discussion offered by textbooks.
One notation point: because this HTML file is not typeset, I will use "*" to mean multiplication (of various sorts). In textbooks this symbol is simply omitted, so that xy is written rather than x * y. If I did that here, the symbols would sometimes be hard to separate.
Section 2.1 of FCS2 introduced a linear neural network unit with output y, inputs x[1] ... x[N] and weights w[1] ... w[N]. Since its output was simply equal to its activation, it computed the function
N y = SIGMA w * x i = 1 i i
Secion 4.1 introduced the idea of a layer of units, in which the units were distinguished by adding an extra subscript to each y and w (but not to x because the inputs are the same for each unit). Thus, if the units in the layer are linear, this equation becomes
N y = SIGMA w * x j i = 1 ji i
where j just specifies which unit we are talking about.
This can be rewritten as a matrix-vector multiplication. What is done is to represent each set of quantities by a single symbol; each such collection is called a matrix or a vector. If the individual quantities have two subscripts, then their collection is a matrix; if the individual quantities have one subscript, then they form a vector. For example, the bold character y stands for y[1], y[2] up to y[N], all put together into a single vector. Likewise w stands for all the weights put together into a matrix and x for all the inputs put together into a vector.
It is important to realise that w, which stands for a matrix, is a completely different kind of object to w[j,i], which stands for a number. The number is said to be an element of the matrix.
A vector, in this sense, is really just a particular kind of matrix - it's a matrix for which the second subscript is always 1, and so there's no point in writing it.
There is a rule for multiplying matrices and vectors. Conveniently, it is just the rule used by a single layer linear network to compute its outputs. That is, we can write
y = w * x
to mean exactly the same as the last equation above. The SIGMA formula defines the operation of multiplying a matrix and a vector; and the single layer linear network provides a paradigm of the operation.
Part of the convention is that it is the second subscript of the w[j,i] variables that becomes the summation variable when a matrix multiplication is written as a computation of individual values. That is the reason for making i the second subscript to w in the textbooks and in the earlier teach files: it fits in with the standard convention for matrices.
In summary, the computation performed by a single layer linear network is to multiply the weight matrix by the input vector. This is probably one of the easiest ways to give meaning to the idea of matrix multiplication.
It is sometimes useful to have a convention for writing out the elements of a matrix. In the case of a network with 3 output units and 4 inputs, one might draw up a table of weights like this:
Input 1 Input 2 Input 3 Input 4 Unit 1 3.2 2.0 -0.5 2.3 Unit 2 -0.4 6.7 1.1 -4.2 Unit 3 1.2 -2.5 0.3 -0.8
Here, for example, -0.4 is the weight on the connection from input 1 to unit 2. (I have just made up some arbitrary numbers.) Therefore for this network, we would have w[2, 1] = -0.4, using our standard convention for ordering subscripts.
The convention for writing down the elements of a matrix in a table is the one adopted here. That is, the first subscript says which row of the table an element is in, and the second subscript says which column it is in. That is, w[j, i] means w[row, column], which in this case means w[unit_number, input_number].
To display a matrix, the elements are written out in a table as above, but without the row and column labels, and the whole thing is enclosed in square brackets (or in a few books, round brackets).
We can summarise this convention by writing, for our 3-unit 4-input network
- - | w[1,1] w[1,2] w[1,3] w[1,4] | w = | w[2,1] w[2,2] w[2,3] w[2,4] | | w[3,1] w[3,2] w[3,3] w[3,4] | - -
The vectors x and y are written as if they were matrices with a single column (and for this reason are sometimes called column vectors). That is, for example
- - | y[1] | y = | y[2] | | y[3] | - -
and likewise for x, except it has 4 elements in this example.
If you write out the matrix multiplication
y = w * x
using the display notation, you get
- - - - - - | y[1] | | w[1,1] w[1,2] w[1,3] w[1,4] | | x[1] | | y[2] | = | w[2,1] w[2,2] w[2,3] w[2,4] | | x[2] | | y[3] | | w[3,1] w[3,2] w[3,3] w[3,4] | | x[3] | - - - - | x[4] | - -
If you look at the original formula for the neural network, you should be able to see that the rule for working out y[i] is to take the elements of the i'th row of w and multiply each one by the corresponding element of the only column of x, and add them up. This "row into column" idea is one of the standard ways of presenting matrix multiplication. You should realise, though, that it is merely a result of the convention that is usually adopted for writing down matrices as tables - there is nothing fundamental about it.
You should be able draw a diagram of the 3-unit 4-input network, write the weights in the table above in the right places, invent some input values, and work out some output values (a) by looking at the diagram, (b) by using the SIGMA formula and (c) by using the matrix display method. And you should get the same results each time. Ideally you might also do it (d) using Matlab and (e) using your own program written in the language of your choice - but that's hardly necessary if you understand what is going on.
Now suppose the single-layer network can be applied to a lot of different input vectors, and we want to specify which one we are dealing with. The obvious thing to do is to add yet another subscript to the original equation to specify a particular input.
N y = SIGMA w * x jk i = 1 ji ik
The subscript k is used to specify which of the set of inputs we are referring to. Here the ws don't get another subscript because they're going to stay the same all the time (we're ignoring learning now), but the y does get another subscript because it will be different for each different input vector.
Matrix notation handles this extension very easily. The objects x and y must now be proper matrices rather than vectors, because their individual elements have two subscripts. Given that, the equation above defines matrix-matrix multiplication just as the earlier equation defined matrix-vector multiplication; we can still write
y = w * x
In terms of the display convention, each column of the x matrix refers to a different input example, whilst each row of x refers to a different input line into the network. Each column of y refers to the output from a particular input example, whilst each row refers to a different output line from the network.
It should be reasonably obvious from this that a matrix-matrix multiplication is just like treating each column of the rightmost matrix as a separate vector, doing a matrix-vector multiplication on it, and assembling the results into the output matrix. This is like saying that our neural network treats each different input vector separately from all the others.
Matrix multiplication does not get any more complex than this.
There are some simple operations on matrices that you should know the conventions for. Matrix addition just means adding each element in one matrix to the corresponding element in another; obviously they must be the same shape. In symbols:
if z = x + y, then z = x + y ij ij ij
Matrix subtraction is similar.
Scalar multiplication of a matrix just means multiplying each element by the same number. In symbols:
if z = k * x, then z = k * x ij ij
where k is an ordinary number (called a scalar).
Sometimes it's useful to swap the order of the subscripts in a matrix. This is called transposing it. In terms of the display convention, one writes the rows as columns, and vice versa. A "T" superscript is used to indicate the operation. In symbols:
T if z = x , then z = x ij ji
Remember what these symbolic statements represent: the bit before the "then" refers to operations regarded as somehow happening to the matrices as entire objects, whilst the bit after the "then" refers to what happens to individual elements when this operation occurs.
Given that we can do matrix addition, subtraction and multiplication, what about matrix division? That is, if we can write
y = w * x
to find the outputs given the inputs to our network, can we write
x = y / w
to express a computation that finds out what inputs caused a given output, as we could for ordinary numbers?
The answer is no, for two reasons. The trivial one is that the "division" notation just isn't used (or at least it's almost never used) to mean what we want it to mean here. The more serious one is that the computation might not be possible: there might just not be enough information in y and w to say what x is.
First the notation bit. When this computation can be done, what is actually written is
-1 x = w * y
-1 where w is called the inverse of w, and is itself a matrix. It is used to find x by doing a matrix multiplication on y. In the text, I will use Inv(w) to refer to it, to avoid the superscript.
Second, the problem with whether the computation is possible. It is easy to find an example of when it is not: if there are more inputs than outputs, then the inverse problem (going from outputs to inputs) is said to be underdetermined - there's (usually) not a unique solution. As a very simple example, say the net has two inputs and one output and both weights are 2. Then inputs of x[1]=2 and x[2]=2 will give an output y[1]=8. But so will x[1]=4 and x[2]=0, and so will any number of other input combinations, so there's not a single solution to the problem of finding the inputs.
The computation may also be impossible if there are more outputs than inputs. Then, the inverse problem is said to be overconstrained and there's (usually) no solution at all. Again, a trivial example will illustrate this. Suppose there is one input feeding two output units, and again both weights are 2. If y[1] is different from y[2], then there's no possible value for the input - we can only solve the problem if we assume that the outputs are compatible with the weights, which in this case means that they are both the same, and we can ignore one of them.
Sometimes, however, it is possible to find x given y and w: in other words the matrix Inv(w) exists. Given what has just been said, it clearly helps if there are the same number of inputs as outputs; that is, w is square. In addition, w must in some sense transmit all the information in x through to y in order for us to be able to go backwards. For square weight matrices the inverse exists unless one network output effectively duplicates the information available in some other outputs. Matrices where the inverse exists are called nonsingular.
Computing the inverse of a matrix is an important computational operation. Although we do not often want to literally work out the inputs to a linear network layer from its outputs, there are, for example, training methods based on the inverse of a matrix of partial derivatives (which is beyond our present scope).
What is useful to know is that if the inverse of a matrix exists, then it can be computed for particular numerical values, and that almost all numerical packages devote considerable effort to providing routines for this purpose. What you probably do not need to know are the algorithms such routines use - though any textbook on numerical analysis will give a great deal of detail to this topic.
Matrices play such an important role in many systems, especially simulations of various kinds, that there is a large literature concerned with analysing their properties. In particular, the decomposition of a square symmetrical matrix into eigenvalues and eigenvectors is an important tool both theoretically and computationally. This is not the place to embark on a discussion of these ideas, though you might meet them in mathematics text books.
It is worth mentioning that there is one particularly useful way of tackling underdetermined and overconstrained inverse problems. This is the singular value decomposition of a matrix. When applied to an underdetermined problem (i.e. short fat matrix), it allows us to pick out one of the many solutions. (In fact, the solution it picks out is the one with the smallest sum of squares of the elements of the x vector.) When applied to an overconstrained problem (i.e. tall thin matrix) it finds an approximate result, in that it finds an input that when fed into the network will produce an output that is as close as possible to the y that we started with. (As close as possible means that the sum of squares error is the minimum.) The SVD is useful in other ways also; it is worth being aware of the existence of this technique so that you can look it up if you want it.
In general, the analysis of matrices is closely tied in to a geometrical view of mathematics, and in particular the idea of transformations in a vector space. Such increasingly abstract ideas lead to increasing power and generality, but require considerable time and study.
This file has introduced the idea of the matrix (and vector) as an object that can be manipulated mathematically in various ways. The central idea is that a collection of quantities can be given a single symbol and manipulated as a single entity, but that what is "really" going on is a set of more elementary operations on the individual elements. It is essential to see the relation between these levels of description.
Matrices are useful both as a notational shorthand, and also in a more fundamental way when properties such as the inverse are exploited. The shorthand aspect has a programming metaphor: writing down a matrix equation is like calling a procedure that takes an array as an argument and carries out an operation on it. If the operation is well defined, the procedure can be treated as a black box, and we do not need to know what happens in detail to the individual elements.
In practice, too much use of matrix shorthand can get in the way. It is often clearer to write down what happens to individual elements, as was done throughout FCS2, and it is easier to keep track of what is meant, especially when calculus is involved. However, as soon as you want to do something equivalent to propagating information backwards through a single-layer network, then matrix analysis is the area you need to look at.
Copyright University of Sussex 1997. All rights reserved.