Lab work for Machine Learning

Chris Thornton

First lab

Labs commence in week 2. There are no labs in week 1.

The arrangement is that you go in the lab at the start of the session and begin work straight away, picking up where you left off the previous session if necessary. Use the first five minutes of each lab to review the last two lectures presented. There should be a list of questions at the end of each lecture page. You can use these to test your knowledge.

It's understood people will progress through the exercises at their own speed. I don't expect everyone to be doing the same exercise at the same time.

People will be on hand (either me or a lab tutor) to answer questions.

If you've not had the lecture on k-means clustering by the time of your first lab, use the time to go through the website and look at the arrangements for assessment. Once you've had the lecture on k-means, proceed with the exercises below.

Lab tasks

The first labs are used for learning the k-means clustering method, one of the simplest ML methods for numeric data. Let's say we have the following nine data, which specify classifications for nine combinations of VAR1 and VAR2.

   VAR1     VAR2     CLASS

  1.713    1.586       0
  0.180    1.786       1
  0.353    1.240       1
  0.940    1.566       0
  1.486    0.759       1
  1.266    1.106       0
  1.540    0.419       1
  0.459    1.799       1
  0.773    0.186       1

The problem is to predict a classification for a case where VAR1=0.906 and VAR2=0.606, using the result of k-means clustering with 3 means (i.e., 3 centroids).

If you're learning programming for the first time this term, you may want to to solve this problem by hand-simulating the k-means clustering process. You'll need a big piece of paper for this but it shouldn't take more than an hour.

If you already have programming skills then you should aim to implement a k-means clustering program that uses 3 means. Ideally, you should construct the program from scratch using the specification from the lecture. If you're completely stuck, you can base your work on the very simple Java application you'll find here. But note that this program uses only 2 means, not 3. You'll need to modify it to get the desired result.

Once you've done all this, the next task is to modify the program so that you can set it to run with any value of means.

The final task is to modify the program so that it will automatically handle prediction tasks, such as the one above. You'll need to set things up so that your program can tell the difference between given values of data, and to-be-predicted values (e.g. classifications). It also need to be able to detect when the model has stabilized, and generate an appropriate prediction at that point.

If you complete all of this before week 5, you can go on to the decision tree exercise below. If you haven't finished these tasks by week 5, you should probably call a halt and move on to the main assignement at that point. But get advice from the lab tutor on this.

Decision-tree exercise

The following training examples map descriptions of individuals onto high, medium and low credit-worthiness.

  medium   skiing   design      single   twenties no  -> highRisk
  high     golf     trading     married  forties  yes -> lowRisk
  low      speedway transport   married  thirties yes -> medRisk
  medium   football banking     single   thirties yes -> lowRisk
  high     flying   media       married  fifties  yes -> highRisk
  low      football security    single   twenties no  -> medRisk
  medium   golf     media       single   thirties yes -> medRisk
  medium   golf     transport   married  forties  yes -> lowRisk
  high     skiing   banking     single   thirties yes -> highRisk
  low      golf     unemployed  married  forties  yes -> highRisk

Input attributes are (from left to right) income, recreation, job, status, age-group, home-owner.

Identify any contradictions in the data.
What is the unconditional probability of `golf' in the dataset?
What is the conditional probability of `single' given `medRisk' in the dataset?
Show how Bayes rule would be applied to probabilities derived from the dataset to calculate the conditional probability of `highRisk' given `low'.
Draw out the tree that would be constructed by the decision-tree method for these examples. If you have no way to formally calculate uniformity (entropy) values, estimate these informally.
Calculate the classification error rate generated by your decision tree for the following unseen examples.

  medium   flying   banking     married  thirties yes -> lowRisk
  high     speedway media       single   forties  yes -> highRisk
  low      golf     transport   married  thirties yes -> medRisk

List the ways the data representation might be changed to promote better generalisation.