Let $m$ be the number features. $\ell_1$-LSMI attempts to find an $m$-dimensional sparse weight vector which maximizes the squared-loss mutual information (SMI) between input variable $X$ and the output $Y$.

where $\widehat{I}_s$ is Least-Squares Mutual Information (LSMI) , an estimator of SMI. $z>0$ is the radius of the $\ell_1$-ball which controls the sparseness of $\boldsymbol{w}$. Features are selected according to the non-zero coefficients of the learned $\widehat{\boldsymbol{w}}$.


Matlab implementation of $\ell_1$-LSMI: l1lsmi.zip. A full source tree can also be obtained from this github page. See the Github page for a usage demo. Alternatively, you may


Define a toy dataset having xor relationship as follows:

This is a binary classification problem with 2 true ($X_1$ and $X_2$) and 8 distracting features. By setting the desired number of features ($k$) to 2, $\ell_1$-LSMI automatically finds the value of $z$ such that two features can be obtained. The learned sparse weight vector $W$ is shown as follows.

Xor artificial data

It can be seen that $\ell_1$-LSMI is able to correctly identify the dependent features $X_1$ and $X_2$. The rest has 0 weights. Since only weights of $X_1$ and $X_2$ are non-zeros, only $X_1$ and $X_2$ need to be kept.