Common tasks:

- Using data from credit card usage, derive a rule which
identifies people that represent a bad credit risk.
- Using data mapping visual signals to pedal/wheel
movements, derive a model which allows a robot to drive a
car down a motorway.

But modern research is increasingly focussed on practical tasks.

- Annual rainfall in Sussex for the last twenty years;
- Age and salary for all members of Sussex faculty.
- Number of iPads sold in Brighton per week.

Each datapoint combines a particular set of variables, e.g., age, salary and IQ specifically for the Informatics HoD.

Datapoints are also called **vectors** in
neural-networks, and **records** in computer science.

A datapoint may also be called a **datum**.

The relevant variable name often appears at the head of each column.

NAME AGE SALARY IQ smith 42 36K 130 bloggs 29 30K 140 bush 50 60K 120 ...A very common task in ML involves predicting one variable value from all the others.

Where this is the aim, it is usual to put the to-be-predicted variable last.

- Univariate, discrete: one variable with integer/symbolic values.
- Univariate, continuous: one variable with real/continuous values.
- Multivariate, discrete: more than one variable with integer/symbolic
values.
- Multivariate, continuous: more than one variable with
real/continuous values.

We will be interested in a dataset's structure.

But two meanings for `structure'.

Explicit structure = the actual values seen in the datapoints.

Implicit structure = patterns that are seen across the values.

Explicit structure is the year and grade values.

We also see *implicit* structure---a gradual increase in
values over time.

Various ways to model this implicit structure.

We could compute the difference between all years and then average.

This might reveal that grades increase by 0.3% per year on average.

- Prediction, i.e., predict the average grade for the
next year.
- Discounting: work out what current grades are `worth' in
terms of previous years.

- With computers managing/mediating many aspects of our
lives, there has been a huge increase in accumulation of
electronic data.
- With computers increasingly up to the demands of complex
modeling, it is getting easier to process very large
datasets.
- Suspicion is growing in fields such as NLP (Natural
Language Processing) that approaches based on hand-coded
solutions are unlikely to succeed.

These logs embody vast quantities of data and are therefore hard to analyse using traditional methods.

Machine Learning can be used to identify patterns in the data.

These may help identify potentially significant patterns of customer behaviour, enabling better management of the supermarket.

The supermarket could make use of this fact in manipulating sales of cheese and ice-cream.

Modeling these patterns can reveal behavioural rules which increase profit.

For example, the discovery that sharp increases in the price of gold tends to be preceded by long periods of price stability might be the basis for an investment rule.

- Create a dataset where the values represent transactions
and the attributes of account holders.
- Add a variable which records whether the transaction was
fraudulent or not.
- Mine the data to find implicit structure which predicts
whether a transaction is fraudulent or not.
- Use the model to detect fraud.

- Machine Learning involves identifying and representing
patterns in data, for purposes of obtaining a desired
behaviour.
- Data expressed in terms of variables and datapoints.
- Tabulation conventions.
- Univariate v. multivarite, discrete v. continuous
- Explicit v. implicit structure.
- ML involves modeling implicit structure on the basis of
explicit structure.

- If a supermarket wants to increase its sales of frozen
pizzas, what data should it aim to collect?
- In univariate discrete data, how many values would we
expect to find in each datapoint?
- How many data should we expect to find in a multivariate
dataset?
- How many variables are involved in the specification of
multivariate data?
- When tabulating data, how is the number of columns determined?
- In the domain of politics, give one example of a continuous
variable and one example of a discrete variable.
- Newspapers sometimes rank universities in terms of numbers of applicants. What is the explicit structure of the data? Suggest some possible forms of implicit structure.