Each time a split is introduced, the model is made a bit more detailed.
More training cases covered.
The model is said to be incrementally refined using the data as a source of reference.
In this case there is the danger of reaching a point where refinements are simply `learning the errors'.
This is known as over-training or over-fitting.
As model refinement continues we expect to see an improvement in performance on both seen and unseen data.
If the data contains some errors, there will come a point where refinements are simply modeling errors.
At this point, we should see deteriorating generalization, even while performance on seens continues to improve.
We can use cross-validation methods to detect the point at which refinements appear to be producing the effect of over-fitting.
The idea is to terminate learning as soon as we go past the point where performance on unseen examples starts to deteriorate.
Unfortunately, in some situations, we see quite significant variations in generalization performance prior to any over-fitting.
Spotting the critical moment can be quite challenging.
More strongly biased methods are more limited in terms of the patterns they can represent.
So another way of making sure we don't end up `training on noise' is to use a method whose bias effectively rules such patterns out.
In practice, this may be hard to achieve if we don't know what the errors are.
However, the general rule applies.
The VC dimension of a representation system is defined to be
More powerful representations are able to shatter larger sets of datapoints. These have higher VC dimension.
Less powerful representations can only shatter smaller sets of datapoints.
These then have lower VC dimension.
From the intuitive point of view, this makes it less than ideal as a general measure of bias strength.
We could have a system with very low VC dimension that is actually quite weakly biased. This would happen, for example, if the system was able to almost shatter large datasets, while only being able to fully shatter very small ones.
This is because VC dimension provides an upper bound on generalization error.
The mathematics of this are quite complex.
The basic idea is that reducing VC dimension has the effect of eliminating potential generalization errors.
So if we have some notion of how many generalization errors are possible, VC dimension gives an indication of how many could be made in any given context.
The subfield of Computational Learning Theory is concerned with deriving VC-dimension bounds in different training scenarios.