$\ell_1$LSMI
Let $m$ be the number features. $\ell_1$LSMI attempts to find an $m$dimensional sparse weight vector which maximizes the squaredloss mutual information (SMI) between input variable $X$ and the output $Y$.
where $\widehat{I}_s$ is LeastSquares Mutual Information (LSMI) , an estimator of SMI. $z>0$ is the radius of the $\ell_1$ball which controls the sparseness of $\boldsymbol{w}$. Features are selected according to the nonzero coefficients of the learned $\widehat{\boldsymbol{w}}$.
Download
Matlab implementation of $\ell_1$LSMI: l1lsmi.zip. A full source tree can also be obtained from this github page. See the Github page for a usage demo. Alternatively, you may
 Run startup.m to include all files into the search path.</li>
 See and run demo_pglsmi.m.</li>
Examples
Define a toy dataset having xor relationship as follows:
 $Y = \text{xor}(X_1,X_2)$, where $\text{xor}(X_1,X_2)$ denotes the XOR function for $X_1$ and $X_2$.</li>
 $X_1,\ldots,X_5 \sim \text{Bernoulli(0.5)}$ where $\mathrm{Bernoulli}(p)$ denotes the Bernoulli distribution taking value 1 with probability $p$. </li>
 $X_6,\ldots,X_{10} \sim \text{Bernoulli(0.75)}$. </li>
This is a binary classification problem with 2 true ($X_1$ and $X_2$) and 8 distracting features. By setting the desired number of features ($k$) to 2, $\ell_1$LSMI automatically finds the value of $z$ such that two features can be obtained. The learned sparse weight vector $W$ is shown as follows.
It can be seen that $\ell_1$LSMI is able to correctly identify the dependent features $X_1$ and $X_2$. The rest has 0 weights. Since only weights of $X_1$ and $X_2$ are nonzeros, only $X_1$ and $X_2$ need to be kept.
References

Jitkrittum, W., Hachiya, H., Sugiyama, M.
Feature Selection via $\ell_1$Penalized SquaredLoss Mutual Information
IEICE Transaction, vol.96D, no.7, pp.15131524, 2013. 
Jitkrittum, W., Hachiya, H., Sugiyama, M.
Feature Selection via $\ell_1$Penalized SquaredLoss Mutual Information
arXiv:1210.1960  Suzuki, T., Sugiyama, M., Kanamori, T., & Sese, J.
Mutual information estimation reveals global associations between stimuli and biological processes.
BMC Bioinformatics, vol.10, no.1, pp.S52, 2009.
[ paper ]  Kanamori, T., Hido, S., & Sugiyama, M.
A leastsquares approach to direct importance estimation.
Journal of Machine Learning Research, vol.10 (Jul.), pp.13911445, 2009.
[ paper ]