| Kernel PCA

The goal of the ordinary PCA (Principal Component Analysis) is to find a set of orthogonal directions such that the variance of the projected data is maximized. Assume we have samples \(\{x_{i}\}_{i=1}^{n}\). This idea can be formalized as \[ u_{1}=\arg\max_{\|u\|\leq1}\frac{1}{n}\sum_{i=1}^{n}\left(u^{\top}x_{i}-u^{\top}\bar{x}\right)^{2} \] where \(\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}\) and \(u\) is the direction we wish to find. Define \(C=\frac{1}{n}XHX^{\top}\) where \[\begin{eqnarray*} X & = & \left(x_{1}|\cdots|x_{n}\right)\\ H & = & I-\frac{1}{n}\boldsymbol{1}\boldsymbol{1}^{\top}. \end{eqnarray*}\] Then the PCA problem can be solved by considering the eigenvalue problem: \[ Cu=\lambda u. \]

KPCA

In kernel PCA or KPCA, the goal is to do PCA in the feature space. That is \(\{x_{i}\}_{i=1}^{n}\) are mapped into a high (or even infinite) dimensional space and PCA is carried out in that space. Let \(\phi\) be such feature map. KPCA can be formalized as \[ f_{1}=\arg\max_{\|f\|_{\mathcal{H}}\leq1}\frac{1}{n}\sum_{i=1}^{n}\left\langle f,\phi(x_{i})-\bar{\phi}\right\rangle _{\mathcal{H}}^{2} \] where \(\bar{\phi}=\frac{1}{n}\sum_{i=1}^{n}\phi(x_{i})\), \(f\) is the direction (can be infinite-dimensional) we wish to find, and \(\mathcal{H}\) is assumed to be a reproducing kernel Hilbert space with the inner product given by a kernel \(k\) i.e., \(\left\langle \phi(x),\phi(y)\right\rangle _{\mathcal{H}}=k(x,y)\). Let \(\tilde{\phi}_{i}:=\phi(x_{i})-\bar{\phi}\) for convenience. The squared inner product in the sum can be simplified as \[\begin{eqnarray*} \left\langle f,\tilde{\phi}_{i}\right\rangle _{\mathcal{H}}^{2} & = & \left\langle f,\tilde{\phi}_{i}\right\rangle _{\mathcal{H}}\left\langle f,\tilde{\phi}_{i}\right\rangle _{\mathcal{H}}=\left\langle f,\left\langle f,\tilde{\phi}_{i}\right\rangle _{\mathcal{H}}\tilde{\phi}_{i}\right\rangle _{\mathcal{H}}\\ & = & \left\langle f,\left(\tilde{\phi}_{i}\otimes\tilde{\phi}_{i}\right)f\right\rangle _{\mathcal{H}}. \end{eqnarray*}\] With this form, the optimization problem can be rewritten as \[ f_{1}=\arg\max_{\|f\|_{\mathcal{H}}\leq1}\left\langle f,Cf\right\rangle _{\mathcal{H}} \] where \(C:=\frac{1}{n}\sum_{i=1}^{n}\tilde{\phi}_{i}\otimes\tilde{\phi}_{i}\). Notice that \(Cf\) involves an inner product of \(f\) and \(\tilde{\phi}_{i}\). This implies that if \(f\) has any component orthogonal to \(\tilde{\phi}_{i}\), it will vanish after taking the inner product. Hence, we can conclude that \(f\) must be a linear combination of \(\{\tilde{\phi}_{i}\}_{i}\) (lie in the span of \(\{\tilde{\phi}_{i}\}_{i}\)): \[ f_{l}=\sum_{i=1}^{n}\alpha_{li}\tilde{\phi}_{i}=\tilde{X}\boldsymbol{\alpha}_{l} \] where we define \(\tilde{X}:=\left(\tilde{\phi}_{1}|\cdots|\tilde{\phi}_{n}\right)\) and \(\boldsymbol{\alpha}_{l}\) are coefficients. With the definition of \(\tilde{X}\), we have \(C=\frac{1}{n}\tilde{X}\tilde{X}^{\top}\).

Eigenvalue problem for KPCA

The KPCA can be solved by considering the following eigenvalue problem. \[ f_{l}\lambda_{l}=Cf_{l} \] Recall that \(f_{l}\) can be infinite-dimensional. To solve the problem, we multiply both sides to the left by \(\tilde{X}^{\top}\). \[ \tilde{X}^{\top}f_{l}\lambda_{l}=\tilde{X}^{\top}\tilde{X}\boldsymbol{\alpha}_{l}\lambda_{l}=\tilde{K}\boldsymbol{\alpha}_{l}\lambda_{l} \] where \(\tilde{K}:=\tilde{X}^{\top}\tilde{X}=HKH\) is the centered gram matrix. For the right hand side, \[\begin{eqnarray*} \tilde{X}^{\top}Cf_{l} & = & \frac{1}{n}\tilde{X}^{\top}\tilde{X}\tilde{X}^{\top}f_{l}=\frac{1}{n}\tilde{X}^{\top}\tilde{X}\tilde{X}^{\top}\tilde{X}\boldsymbol{\alpha}_{l}\\ & = & \frac{1}{n}\tilde{K}^{2}\boldsymbol{\alpha}_{l}. \end{eqnarray*}\] Equating the previous two equations, we have \[ n\lambda_{l}\tilde{K}\boldsymbol{\alpha}_{l}=\tilde{K}^{2}\boldsymbol{\alpha}_{l}\iff n\lambda_{l}\boldsymbol{\alpha}_{l}=\tilde{K}\boldsymbol{\alpha}_{l} \] which is just an \(n\)-dimensional eigenvalue problem.

Norm constraint of KPCA

Recall that we have the constraint \(\|f\|_{\mathcal{H}}\leq1\). Since the maximization problem benefits from scaling up \(f\), the solution will satisfy \(\|f\|_{\mathcal{H}}=1=\|f\|_{\mathcal{H}}^{2}\). \[\begin{eqnarray*} \|f\|_{\mathcal{H}}^{2} & = & 1=\left\langle \tilde{X}\boldsymbol{\alpha},\tilde{X}\boldsymbol{\alpha}\right\rangle _{\mathcal{H}}\\ & = & \boldsymbol{\alpha}\tilde{K}\boldsymbol{\alpha}\\ & = & n\lambda\boldsymbol{\alpha}^{\top}\boldsymbol{\alpha}=1 \end{eqnarray*}\] where we use the identity from the eigenvalue equation in the last two lines. The last line implies that \[ \|\boldsymbol{\alpha}\|_{\mathcal{H}}^{2}=\frac{1}{n\lambda}\iff\|\boldsymbol{\alpha}\|_{\mathcal{H}}=\frac{1}{\sqrt{n\lambda}}. \] This means that to force the norm constraint \(\|f\|_{\mathcal{H}}=1\), we need to make sure that \(\|\boldsymbol{\alpha}\|_{\mathcal{H}}=\frac{1}{\sqrt{n\lambda}}\). This can be done by renormalizing \(\boldsymbol{\alpha}\) obtained from the eigenvalue problem by \[ \boldsymbol{\alpha}\leftarrow\frac{\boldsymbol{\alpha}}{\|\boldsymbol{\alpha}\|_{2}}\frac{1}{\sqrt{n\lambda}}. \]

KPCA projection

Define the projection operator \(P_{d}\) for projection onto the first \(d\) eigenvectors as \[\begin{eqnarray*} P_{d}\phi(x^{*}) & = & \sum_{i=1}^{d}P_{f_{i}}\phi(x^{*})\\ P_{f}\phi(x^{*}) & = & \left\langle f,\phi(x^{*})\right\rangle _{\mathcal{H}}f. \end{eqnarray*}\] Given a new point \(x^{*}\) its projection can be found with \[ y^{*}=\arg\min_{y\in\mathcal{X}}\|\phi(y)-P_{d}\phi(x^{*})\|_{\mathcal{H}}^{2}. \] This is useful, for example, in image denoising where \(x^{*}\) is a noisy image, and a denoised (projected) image \(y^{*}\) is the goal.

KPCA coordinates

If dimensionality reduction or visualization is the goal, then the projection step is not needed. In these cases, one simply needs the first \(d\) coordinates (where \(d\) is usually chosen to be low). The \(d\) coordinates are given by \[ \left(\begin{array}{c} \left\langle \phi(x^{*}),f_{1}\right\rangle _{\mathcal{H}}\\ \left\langle \phi(x^{*}),f_{2}\right\rangle _{\mathcal{H}}\\ \vdots\\ \left\langle \phi(x^{*}),f_{d}\right\rangle _{\mathcal{H}} \end{array}\right)=\left(\begin{array}{c} \left\langle \phi(x^{*}),\tilde{X}\boldsymbol{\alpha}_{1}\right\rangle _{\mathcal{H}}\\ \left\langle \phi(x^{*}),\tilde{X}\boldsymbol{\alpha}_{2}\right\rangle _{\mathcal{H}}\\ \vdots\\ \left\langle \phi(x^{*}),\tilde{X}\boldsymbol{\alpha}_{d}\right\rangle _{\mathcal{H}} \end{array}\right)=\left(\begin{array}{c} \sum_{i=1}^{n}\tilde{k}(x^{*},x_{i})\alpha_{1,i}\\ \sum_{i=1}^{n}\tilde{k}(x^{*},x_{i})\alpha_{2,i}\\ \vdots\\ \sum_{i=1}^{n}\tilde{k}(x^{*},x_{i})\alpha_{d,i} \end{array}\right). \] Note that \(f_{1}\) corresponds to the largest eigenvector, \(f_{2}\) the second largest, and so on.