Machine Learning - Lecture 1: Introduction to the Topic
Chris Thornton
What is Machine Learning?
Machine Learning (ML) is the use of data to acquire the
rules for a desired behaviour.
Common tasks:
- Using data from credit card usage, derive a rule which
identifies people that represent a bad credit risk.
- Using data mapping visual signals to pedal/wheel
movements, derive a model which allows a robot to drive a
car down a motorway.
Is it to do with human learning?
Traditionally, ML has involved ideas about how human
learning works.
But modern research is increasingly focussed on practical
tasks.
What do we mean by `data'?
By `data' we mean sets of variable values, e.g.,
- Annual rainfall in Sussex for the last twenty years;
- Age and salary for all members of Sussex faculty.
- Number of iPads sold in Brighton per week.
Datapoints
Values are organised in structures called datapoints.
Each datapoint combines a particular set of variables, e.g.,
age, salary and IQ specifically for the Informatics HoD.
Datapoints are also called vectors in
neural-networks, and records in computer science.
A datapoint may also be called a datum.
Tabulation
Data are often presented in a tabulated form, with one
datapoint per row, and one variable per column.
The relevant variable name often appears at the head of each
column.
NAME AGE SALARY IQ
smith 42 36K 130
bloggs 29 30K 140
bush 50 60K 120
...
A very common task in ML involves predicting one variable
value from all the others.
Where this is the aim, it is usual to put the
to-be-predicted variable last.
Basic data-types
Data are classified according to the number and character of
variables involved.
- Univariate, discrete: one variable with integer/symbolic values.
- Univariate, continuous: one variable with real/continuous values.
- Multivariate, discrete: more than one variable with integer/symbolic
values.
- Multivariate, continuous: more than one variable with
real/continuous values.
Explicit and implicit structure
A dataset is a body of data, i.e., a collection of
datapoints.
We will be interested in a dataset's structure.
But two meanings for `structure'.
Explicit structure = the actual values seen in the
datapoints.
Implicit structure = patterns that are seen across the
values.
Example: A-level grades
Dataset containing average A-level grades for the past ten years.
Explicit structure is the year and grade values.
We also see implicit structure---a gradual increase in
values over time.
Various ways to model this implicit structure.
We could compute the difference between all years and then
average.
This might reveal that grades increase by 0.3% per year on
average.
Ways of using the model
The model could then be used for
- Prediction, i.e., predict the average grade for the
next year.
- Discounting: work out what current grades are `worth' in
terms of previous years.
Why machine learning now?
Machine learning is an increasingly central topic in
informatics.
- With computers managing/mediating many aspects of our
lives, there has been a huge increase in accumulation of
electronic data.
- With computers increasingly up to the demands of complex
modeling, it is getting easier to process very large
datasets.
- Suspicion is growing in fields such as NLP (Natural
Language Processing) that approaches based on hand-coded
solutions are unlikely to succeed.
Real-world applications: learning consumer behaviour
Use of CCTV and automatic checkout machines in modern
supermarkets enables detailed logs to be kept of purchases
made, reductions on offer, counter locations etc.
These logs embody vast quantities of data and are therefore
hard to analyse using traditional methods.
Machine Learning can be used to identify patterns in the
data.
These may help identify potentially significant patterns of
customer behaviour, enabling better management of the
supermarket.
Cheese and ice cream
Modeling might reveal that increases in purchases of
ice-cream tend to be accompanied by small reductions in
purchases of cheese.
The supermarket could make use of this fact in
manipulating sales of cheese and ice-cream.
Example: mining financial data
In this application, the data are price fluctuations and the
aim is to extract regularities reflecting
investment opportunities.
Modeling these patterns can reveal behavioural rules which
increase profit.
For example, the discovery that sharp increases in the price
of gold tends to be preceded by long periods of price
stability might be the basis for an investment rule.
Fraud detection
Predicting fraudulant cases in credit-card transations
- Create a dataset where the values represent transactions
and the attributes of account holders.
- Add a variable which records whether the transaction was
fraudulent or not.
- Mine the data to find implicit structure which predicts
whether a transaction is fraudulent or not.
- Use the model to detect fraud.
Summary
- Machine Learning involves identifying and representing
patterns in data, for purposes of obtaining a desired
behaviour.
- Data expressed in terms of variables and datapoints.
- Tabulation conventions.
- Univariate v. multivarite, discrete v. continuous
- Explicit v. implicit structure.
- ML involves modeling implicit structure on the basis of
explicit structure.
Questions
- If a supermarket wants to increase its sales of frozen
pizzas, what data should it aim to collect?
- In univariate discrete data, how many values would we
expect to find in each datapoint?
- How many data should we expect to find in a multivariate
dataset?
- How many variables are involved in the specification of
multivariate data?
- When tabulating data, how is the number of columns determined?
- In the domain of politics, give one example of a continuous
variable and one example of a discrete variable.
- Newspapers sometimes rank universities in terms of numbers
of applicants. What is the explicit structure of the data?
Suggest some possible forms of implicit structure.