FCS1: Some differential calculus

David Young, October 1997

This teach file provides some background information for sessions 2 and 3 of the course Formal Computational Skills. These deal with the application of ideas from differential calculus to the analysis of neural networks in which the signal can be represented by continuously varying quantities.

Introduction
Functions of a single variable
Differentiation
Functions of several variables
Partial differentiation
Some kinds of Ds
Summation

Introduction

This cannot replace textbooks - or it would be one. Rather, it's an outline intended to enable you to find out something about a set of techniques that is useful in analysing some kinds of neural nets (as well as many other systems).

Some of you will find all this familiar already. You can skim it in a few minutes and move on to something else.

Some of you will remember doing this once, but it's now rusty. You should look through this file and see whether you can still understand what is going on, especially in the examples. You might have to check the odd thing out in a textbook. You should ask about anything that isn't clear. You might have to spend an hour or two brushing the dust off the material.

Some of you will find this either new, or thoroughly lost in the mists of time. You may well need to go over old notes, or look up textbooks, and you should work through a few examples to make sure you do have the ideas straight. You should ask for help if you can't fathom something.

You are not expected to become fast and expert in all this material - that takes more time than we have. What you should aim for is to understand these techniques, so that you can follow an argument that involves them in a paper or a book.

Textbooks

Schaum's outline series is good on specific topics, but beware information overload.

If parts 2, 3 and 4 of this file are difficult, then you need to refer to an introductory textbook of about A-level standard, though books explicitly for A-level are often too closely tied to the examination syllabus. "Foundation Mathematics" 2nd edition, by D.J. Booth (Addison Wesley, 1994) looks useful, though at present is not available in the University Library. The best book for you is largely a matter of personal taste - you should try to find something that suits you.

For parts 5 onwards, you need to look at a more advanced book. Books for mathematicians spend too much time establishing a rigorous basis for everything - you need a book of mathematics for engineers or physicists. I use "Mathematical Methods in the Physical Sciences" by M. L. Boas (Wiley, 1st edition 1966, 2nd edition 1983) - the library has multiple copies (at QE 7000 Boa), as has the bookshop. "Mathematical Techniques: an Introduction for the Engineering, Physical and Mathematical Sciences", by D.W. Jordan & P. Smith, covers some similar ground but seems to have more introductory material than Boas. The library at present has a single copy at QE 7000 Jor. You may already have a personal preference - if so, stick with it.

A dictionary of mathematics can be surprisingly handy. It won't explain things in the way a textbook will, but it is often very useful to remind oneself of some particular bit of usage. They usually have some useful tables (e.g. of derivatives). The Penguin Dictionary of Mathematics is good, as is the Oxford Dictionary.

A note on notation

For simplicity the mathematical formulae in this file are in a plain text format. For this reason, "programming" notation will be used to represent mathematical operations. That is, multiplication will be representated by "*", exponentiation (raising to a power) by ^, and raising the constant e to the power x by exp(x).

In addition, it will sometimes be neater to use brackets instead of subscripts. That is, x with subscripts i and j, usually written something like

                        x
                         ij

will sometimes appear as x[i, j], especially when mentioned in a block of text.

Substitutions for other symbols will be introduced as they appear.

Functions of a single variable

You should be familiar with the idea of a function of a variable. Roughly speaking, a function (sometimes called a mapping) can be thought of as taking as "input" one value and producing as "output" another value. The general notation is y = f(x), where x is the name of the "input" variable, or argument, f is the name of the function, and y is the name of the "output" variable. Often x is called the independent variable and y is called the dependent variable.

Many of the functions we will need take a real number as an argument. (A real number is one that can be written as a decimal value, like 3.2712 - but possibly with an unlimited number of digits.) Examples include:

    y = sin(x)      y = cos(x)      y = tan(x)
    y = log(x)      y = exp(x)
    y = 3 * x + 2                   y = 3 * x^2 + 2 * x - 333

Note that the last two do not use the f(x) notation, but still represent functions. (You should be familiar with the convention that multiplication and division are done before addition and subtraction, and exponentiation is done first of all, by the way.)

It's also possible to have functions like

    if x is greater than 13 then y = 1, otherwise y = 0

Mathematicians use many tools to understand the properties of functions, for example the series expansion. We will not generally need this level of analysis.

The first thing we usually need to know is how to evaluate a function - that is, how to find a value of f(x) for some specific x. You will nearly always do this with the aid of a computer program in some form - so knowing what functions you can evaluate depends on knowing something about the libraries available with your current programming language. It is possible to evaluate the functions listed above in almost every language. In fact, almost all computed evaluations of functions are approximate, and sometimes it is important to know how this affects the result of a program.

It is also often important to be able to visualise the function, by drawing its graph. Again, it is now normal to use a computer-based method for this - check out packages like Matlab. When you draw a graph, think of each point on the paper (or screen) as representing a pair of values, x and y. The curve that is plotted represents the subset of values defined by the function. The notation (x, y) can be used to represent a pair of values, as well as the point in the plane that represents that pair.

Finally, you may need to use some properties of the function. For example, the trigonometric functions mentioned above are periodic - adding 2*pi (about 6.283) to the value of x, for any x, produces the same result y (check what this means visually by drawing the graph). This property would be written down as for sin, say, as sin(x) = sin(2*pi + x). Another example is that the log function always increases if its argument increases - you could write this as log(u) > log(v) if u > v. Properties like this are sometimes apparent from the graph, and are worth picking up as you go along when you encounter a particular function.

These ideas should be familiar to most people. If you are rusty, a good way to become familiar with them again is to plot some graphs using a package, or indeed by hand if you prefer. You should have a nodding acquaintance with all the functions listed above.

Differentiation

The basic idea of the differential calculus is that of a rate of change. Consider a function whose graph is a straight line, such as y = 3*x + 2. Any change in x produces a change 3 times as big in y. (On the graph, this can be seen by drawing a right-angled triangle below the line, with two of its sides parallel to the axes.) The slope of the line is said to be 3 in this case, for every x (because it's a straight line, the slope is the same everywhere).

When we have a curve instead of a straight line, the amount of change in y produced by a change in x may depend both on how big the change is, and what value of x we started from. However, for many functions (and for most that are practically useful), the idea of the change in y produced by a small change in x turns out to be a consistent and valuable one. The change in y divided by the change in x, as we consider smaller and smaller changes, settles down to a steady value called the derivative of y with respect to x. This is usually written dy/dx. It can still be visualised as the slope of the curve; now though, it's a property of a small section of the curve, and so depends on the value of x.

It is often important to know how a change in one quantity affects another, and so to be able to work out derivatives. To do this, there are various rules that you should be aware of. Some of the more important ones are:

Rules for specific functions:

For example,

            if y = sin(x), then dy/dx = cos(x)

It is possible to work these out from first principles, but usually one would look them up in a table in a textbook, or use a symbolic computing package, to remind oneself of them. You should know where to find the rules for the functions mentioned above.

Rules for classes of functions:

Sometimes a rule is more general. One of the most useful is:

            if y = x^n, then dy/dx = n * x^(n-1)

(Remember that I'm using the ^ symbol to mean raising to a power. E.g. x^2 = x squared.) This applies to a class of functions; the parameter n says which member of the class is being used; you substitute the value for your application. For example, if y = x^4, then dy/dx = 4*x^3.

A simple rule of this type is:

            if y = n * x, then dy/dx = n

which should be obvious by thinking about the graph of the function. Here n is to be thought of as standing for a constant, rather than as being itself a variable.

The rule for products:

If a function can be written down as two functions multiplied together, and you can differentiate each of the two functions separately, then you can differentiate the function itself using the rule

            if y = f(x) * g(x), then dy/dx = f(x) * dg(x)/dx + g(x) * df(x)/dx

(Note that df(x)/dx means dy/dx for y = f(x).)

For example,

            if y = 3 * x * cos(x),
                    then dy/dx = -3 * x * sin(x) + 3 * cos(x)

The chain rule:

If a function can be written as one function applied to the result of another function, then the derivative of the whole thing can be got using

            if y = f(g(x)), then dy/dx = df(z)/dz * dg(x)/dx
                           evaluated for z = g(x)

For example, if y = sin(x^2), then dy/dx = 2*x * cos(x^2). You get to this result by writing z = x^2.

Applying these last two rules, though harder, basically involves substituting one thing for another consistently. If you can't make sense of the rules, then the problem might well lie in the notation for functions, and in remembering what each symbol stands for. Although there is no need to be very fluent in this area, you should be able to understand what is going on (you should be able to see why the examples have the answers they do) and to differentiate most functions that you meet, even if you have to look up the rules.

Functions of several variables

For many applications, the idea of a function outlined above needs to be generalised to functions of more than one real variable. A function of two variables might be written z = f(x, y). You can think of x and y as inputs and z as the output. A very simple example is z = x + y.

Usually, such functions are built out of the 1-dimensional functions described above. When there are two inputs and one output, if is often useful to visualise the function as a surface or landscape: the arguments x and y represent position on a 2-D plane, and the value z represents height above that plane (or below it if negative). Packages such as Matlab are very good at displaying these surfaces.

For functions of more than two variables, there is no simple way to visualise the whole function. Nonetheless, such functions are often discussed in a way that is analogous with the two-variable case.

If a function has many arguments, it may not make sense to give them all separate names. You might see something like

    y = f(x , x , ..., x )
           1   2        N

(or in the notation I am using y = f(x[1], x[2], ..., x[N]) ), meaning that f is a function of N variables, which are distinguished by subscripts rather than by having completely different names. This kind of thing is very common in neural network analysis.

Partial differentiation

It is often necessary to know something about how the value of a function with several inputs is changed by small changes to its arguments - that is, we need to differentiate it. How can this be done?

The basic idea is quite simple. Consider the function z = x * y. Suppose that instead of being a variable, y simply stood for a fixed value - let's say 5. Then the function would be z = x * 5, and so it would follow that in this particular case dz/dx = 5 (it's the straight line equation again). If we didn't know the particular value of y, but we did know that it was fixed, we could still write dz/dx = y, with the understanding that y was being treated as a fixed quantity rather than a variable. This derivative, found by pretending that y is a fixed quantity, is called the partial derivative of the function with respect to x.

In order to distinguish this from an ordinary derivative, some special notation is used: a curly d instead of a normal d. In printed versions of this document this might be represented properly, but in this online plain text file, the best I can do is to use the combination c) which if you run it together looks quite like the symbol in question; also the c might be taken to stand for "curly". Thus the expression

    c) z / c) x  =  y

means "the partial derivative of z with respect to x" - that is, the change in z when x is varied and all other arguments are kept constant.

It is generally quite easy to find partial derivatives, once you have understood the principle of pretending that everything except the variable in question behaves just like a numerical constant. For example:

    if z = 3 * y^2 + y * sin(x + 10*v)

    then
        c) z / c) x  =  y * cos(x + 10*v)

        c) z / c) y  =  6 * y + sin(x + 10*v)

        c) z / c) v  =  10 * y * cos(x + 10*v)

If you can't verify the results in this example, it's probably because you need to check the rules for basic differentiation, rather than because partial differentiation is itself a problem.

Note that the partial derivative may be a function of all or some of the arguments to the original function.

The partial derivative tells us how a function is affected by a perturbation to one of its arguments. This in itself can be very useful. Sometimes it is necessary, though, to know how a function changes when a change is made to many or all of its arguments. This will only make sense if the changes to the arguments are coordinated in some way; that is, the arguments themselves are functions of some other variable that is changing. (This is often the case in neural networks.)

To be definite, suppose z depends on (is a function of) u and v, so z = f(u, v), and u and v both depend on some other variable x, so u = g(x) and v = h(x). (Here, g and h are names of functions.) The question is, how does z vary if x changes?

The answer is given by the chain rule for partial differentiation, which is the most advanced idea to be mentioned in this file. It says that

    d z         c) z   d u       c) z   d v
    ---    =    ---- * ---   +   ---- * ---
    d x         c) u   d x       c) v   d x

(The thing on the left is just dz/dx written differently.) Note that all the quantities on the right can be worked out from the expressions for f, h and g. Putting them together gives the result that is needed. Since nothing is kept constant when x changes, the result on the left of the equation is an ordinary derivative.

This should make some kind of intuitive sense, along these lines: x controls each of u and v, and u and v together control z. So a change in x produces a change in z by two different routes. The effect along the u route is the effect of x on u times the effect on u on z. Similarly for the v route. The two effects get added together.

Sometimes, there are other variables which affect u and v, in addition to x. In this case these other variables have to be held constant while we investigate the effect of x on z. Then the ordinary derivatives in the formula become partial derivatives too, to indicate that these other things are staying constant.

Textbooks will give a proof of this formula, and sometimes a graphical way to think about it as well.

Some kinds of Ds

As light relief, it may be worth mentioning that differential calculus abounds in variants of the letter D. So far, we have only used 2 kinds, but you will encounter others in the literature. To try to avoid confusing them, here is a little table - although you do not need to be familiar with the use of any but the first two at this stage, it is worth knowing that the others exist.

Name	Written in these text files as	Proper symbol	Used for
small d	d	d	Derivative[1]
curly d	c)	[2]	Partial derivative
small delta	delta	[3]	A small change in a variable[4]
capital delta	DELTA	[5]	An arbitrary change in a variable
del or nabla[6]	DEL	[7]	A kind of vector derivative
capital D	D	D	Differential operator[8]

Small d is almost always used in the form dx/dy. The quantity dx is called an infinitessimal, and means a change in x which is smaller than any finite change. Debate has raged about whether it is proper to manipulate infinitessimals in their own right rather than as top or bottom of a derivative. There are hints that they are currently becoming more respectable, but they won't be used here.
Curly d can't be shown here; see any maths book. I might revise this file into a form where it can be displayed.
delta is a Greek letter. It is like a d with some extra wiggles.
Though small, delta x is finite - i.e. not an infinitessimal.
Capital delta looks like a triangle standing on its base.
I'm not sure whether this is kind of "D", but it's easily confused with DELTA so I've put it in.
DEL looks like a triangle standing on its apex.
D f(x) is used to mean dy/dx when y = f(x). It is too concise for elementary use but comes into its own in the study of differential equations.

Summation

Finally, you should be able to read the notation for forming sums - that is, adding a set of things together. This uses the capital Greek letter sigma, which looks a little like this:

and which will be written in plain text files as SIGMA. Here is a simple example of how it is used:

      5
    SIGMA  (k * x)
    k = 1

and this expands into (is equal to)

    x  +  2 * x  +  3 * x  +  4 * x  +  5 * x

In general, there is some variable (in this case k) which takes a set of values (in this case 1,2,3,4 and 5). For each of these values, an expression involving the variable (in this case k*x) is evaluated, and the results added together. In the form in which it is being used here, the variable takes integer values, starting from the one specified below the SIGMA, and going up to the value specified above. (There are a few alternative forms of the notation, but this is the most common.) In this case, there is another variable, x, in the expression, but there might be several other variables, or none.

It is extremely common for the summation variable to form a subscript in the expression, rather than being an arithmetic element as above. For example

       6     2             2    2    2    2
     SIGMA  x      =      x  + x  + x  + x
     j = 3   j             3    4    5    6

The name given to the summation variable (here j) can be chosen arbitrarily but must then be used consistently, as for all variable names.

There is a nice concrete way to think about the summation notation, if you are a programmer. A summation sign acts like a loop in a program, and indeed programs that implement theories involving sums do have corresponding loops. If you happen to know C, for example, then it may help to know that the following line of code implements (with suitable declarations of course) the first example above. It leaves the variable sum set to the value of the whole SIGMA expression, assuming that x has been given a value beforehand:

    for (sum = 0.0, k = 1;  k <= 5;  k++) sum += k * x;

whilst the second example would translate into something like

    for (sum = 0.0, j = 3;  j <= 6;  j++) sum += x[j] * x[j];

Summation gets complicated when you encounter nested summation signs - one SIGMA being applied to an expression containing another SIMGA. There is a fairly safe way to make sure you understand what is going on in cases like this: write out a few terms of the whole expression. You should be able to understand the following:

      2      2                 2
    SIGMA  SIGMA  x      =   SIGMA ( x   +  x  )
    i = 1  j = 1   ij        i = 1    i1     i2

                         =   x   + x   + x   + x
                              11    12    21    22

For those who feel this holds no mysteries, it is worth mentioning that there is a convenient shorthand which is sometimes used for sums, called the repeated suffix convention, or tensor notation. In this convention, any suffix which appears twice in an expression is taken to be summed over - an implicit SIGMA appears before the expression with the repeated variable as the summation variable. This is only useful when the range of summation is obvious. The convention is often a very useful alternative to matrix notation.

Finally, something fairly hard. Let's use the summation notation to generalise the chain rule for partial differentiation. Suppose our output variable z is affected by a load of different intermediate variables - say N of them, which we will call u[1], u[2], ... u[N]. Suppose that x affects each u (possibly in a different way for each). Now if we want to know the effect of x on z, it's going to look like

    d z           N        c) z    d u[i]
    ---    =    SIGMA    ------- * ------
    d x         i = 1    c) u[i]     d x

(with partial instead of ordinary derivatives if there are some other variables being held constant).

If this looks daunting the first thing to do is to write it out in full with N equal to 2. Then the relationship to the earlier formula for the chain rule should become clear.