In learning theory, the VC dimension is a measure of capacity of a class of hypotheses $\mathcal{H}$ (e.g., a set of classifiers). This notion of capacity indicates how complicated $\mathcal{H}$ is. Although complicated $\mathcal{H}$ may be able to fit well to the dataset at hand, yielding a low training error, there is a possibility that it overfits and gives high generalization error. The VC dimension provides a tool to analyze the generalization error of a class of hypotheses based on how complicated it is, independent of the input distribution, the target function, and the learning algorithm (i.e., a systematic approach to choose the best hypothesis $g \in \mathcal{H}$).

Capacity in VC theory is captured by the concept called shattering. Here we focus only on binary classification problems.

A hypothesis class $\mathcal{H}$ is said to shatter $n$ points if there exists a dataset $\boldsymbol{X} = \{x_1,\ldots, x_n\}$ such that for any label assignment $\boldsymbol{y} = (y_1, \ldots, y_n)$ where $y_i \in \{-1, +1\}$, there exists a hypothesis $h \in \mathcal{H}$ which can produce $\boldsymbol{y} = h(\boldsymbol{X})$.

In a simpler terms, we say $\mathcal{H}$ shatters $n$ points if there exists a configuration of $\boldsymbol{X} = \{x_1,\ldots, x_n\}$ such that $\mathcal{H}$ can produce all possible $2^n$ assignments of $\boldsymbol{y}$. Things worth noting are

• If $\mathcal{H}$ can produce any assignment of $\boldsymbol{y}$ on just even one configuration of $n$ points, then we say $\mathcal{H}$ can shatter $n$ points. So, when constructing an example, it makes sense to imagine a configuration of points such that $\mathcal{H}$ can shatter easily.
• If $\mathcal{H}$ can shatter $n$ points, then obviously it can shatter less than $n$ points.
• Likewise, if $\mathcal{H}$ cannot shatter $n$ points, then it cannot shatter more than $n$ points.

The VC dimension of $\mathcal{H}$, denoted by $d_{VC}(\mathcal{H})$, is the largest number of points $\mathcal{H}$ can shatter.

As an example, the VC dimension of a linear classifier in two-dimensional space is 3. That is, three is the highest number of points a line can produce all possible $\{-1, +1\}$ assignments. With four points, there are two cases out of 16 possible assignments a line cannot produce. In general, $d_{VC}(\text{linear classifiers}) = d+1$ where $d$ is the input dimension.

The VC dimension can be used to bound probabilistically the difference between the training and test errors. This result is known as VC inequality.

• Lecture 5 (Training versus Testing)
• Lecture 6 (Theory of Generalization)
• Lecture 7 (The VC Dimension)