Publications

Kernel Stein Tests for Multiple Model Comparison 2019To appear in NeurIPS 2019We address the problem of nonparametric multiple model comparison: given l candidate models, decide whether each candidate is as good as the best one(s) in the list (negative), or worse (positive). We propose two statistical tests, each controlling a different notion of decision errors. The first test, building on the post selection inference framework, provably controls the fraction of best models that are wrongly declared worse (false positive rate). The second test is based on multiple correction, and controls the fraction of the models declared worse that are in fact as good as the best (false discovery rate). We prove that under some conditions the first test can yield a higher true positive rate than the second. Experimental results on toy and real (CelebA, Chicago Crime data) problems show that the two tests have high true positive rates with wellcontrolled error rates. By contrast, the naive approach of choosing the model with the lowest score without correction leads to a large number of false positives.
@misc{nrel_test2019, author = {{Lim}, Jenning and {Yamada}, Makoto and {Sch\"{o}lkopf}, Bernhard and {Jitkrittum}, Wittawat}, title = {Kernel Stein Tests for Multiple Model Comparison}, year = {2019}, wj_note = {To appear in NeurIPS 2019} }

A Kernel Stein Test for Comparing Latent Variable Models ArXiv 2019We propose a nonparametric, kernelbased test to assess the relative goodness of fit of latent variable models with intractable unnormalized densities. Our test generalises the kernel Stein discrepancy (KSD) tests of (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018, Jitkrittum et al., 2018) which required exact access to unnormalized densities. Our new test relies on the simple idea of using an approximate observedvariable marginal in place of the exact, intractable one. As our main theoretical contribution, we prove that the new test, with a properly corrected threshold, has a wellcontrolled typeI error. In the case of models with lowdimensional latent structure and highdimensional observations, our test significantly outperforms the relative maximum mean discrepancy test (Bounliphone et al., 2015) , which cannot exploit the latent structure.
@article{latent_stein_test2019, author = {{Kanagawa}, Heishiro and {Jitkrittum}, Wittawat and {Mackey}, Lester and {Fukumizu}, Kenji and {Gretton}, Arthur}, title = {A Kernel Stein Test for Comparing Latent Variable Models}, journal = {ArXiv}, keywords = {Statistics  Machine Learning, Computer Science  Machine Learning}, year = {2019}, month = jul, eid = {arXiv:1907.00586}, pages = {arXiv:1907.00586}, archiveprefix = {arXiv}, eprint = {1907.00586}, primaryclass = {stat.ML}, wj_http = {https://arxiv.org/abs/1907.00586} }

Generate Semantically Similar Images with Kernel Mean Matching Women in Computer Vision Workshop, CVPR 2019Oral presentation. 6 out of 64 accepted papers.
@misc{cagan_wicv2019, author = {Jitkrittum, Wittawat and Sangkloy, Patsorn and Gondal, Muhammad Waleed and Raj, Amit and Hays, James and {Sch{\"o}lkopf}, Bernhard}, title = {Generate Semantically Similar Images with Kernel Mean Matching}, howpublished = {Women in Computer Vision Workshop, CVPR}, year = {2019}, note = {The first two authors contributed equally. }, wj_http = {/assets/papers/cagan_wicv2019.pdf}, wj_highlight = {Oral presentation. 6 out of 64 accepted papers.}, wj_code = {https://github.com/wittawatj/cadgan} }

Kernel Mean Matching for Content Addressability of GANs ICML 2019Long oral presentation
abstract paper bib poster code talk slides
We propose a novel procedure which adds "contentaddressability" to any given unconditional implicit model e.g., a generative adversarial network (GAN). The procedure allows users to control the generative process by specifying a set (arbitrary size) of desired examples based on which similar samples are generated from the model. The proposed approach, based on kernel mean matching, is applicable to any generative models which transform latent vectors to samples, and does not require retraining of the model. Experiments on various highdimensional image generation problems (CelebAHQ, LSUN bedroom, bridge, tower) show that our approach is able to generate images which are consistent with the input set, while retaining the image quality of the original model. To our knowledge, this is the first work that attempts to construct, at test time, a contentaddressable generative model from a trained marginal model.@inproceedings{cagan_icml2019, author = {Jitkrittum, Wittawat and Sangkloy, Patsorn and Gondal, Muhammad Waleed and Raj, Amit and Hays, James and {Sch{\"o}lkopf}, Bernhard}, title = {Kernel Mean Matching for Content Addressability of {GANs}}, booktitle = {ICML}, year = {2019}, note = {The first two authors contributed equally.}, wj_http = {https://arxiv.org/abs/1905.05882}, wj_code = {https://github.com/wittawatj/cadgan}, wj_poster = {/assets/poster/cadgan_poster_icml2019.pdf}, wj_slides = {https://docs.google.com/presentation/d/1XdsSP7jji2QB_tZf8QAYCJhs6QzGDmR_BLmOHLFCMqY/edit?usp=sharing}, wj_talk = {https://slideslive.com/38917639/applicationscomputervision}, wj_highlight = {Long oral presentation} }

Witnessing Adversarial Training in Reproducing Kernel Hilbert Spaces ArXiv 2019Modern implicit generative models such as generative adversarial networks (GANs) are generally known to suffer from instability and lack of interpretability as it is difficult to diagnose what aspects of the target distribution are missed by the generative model. In this work, we propose a theoretically grounded solution to these issues by augmenting the GAN’s loss function with a kernelbased regularization term that magnifies local discrepancy between the distributions of generated and real samples. The proposed method relies on socalled witness points in the data space which are jointly trained with the generator and provide an interpretable indication of where the two distributions locally differ during the training procedure. In addition, the proposed algorithm is scaled to higher dimensions by learning the witness locations in a latent space of an autoencoder. We theoretically investigate the dynamics of the training procedure, prove that a desirable equilibrium point exists, and the dynamical system is locally stable around this equilibrium. Finally, we demonstrate different aspects of the proposed algorithm by numerical simulations of analytical solutions and empirical results for low and highdimensional datasets.
@article{witness_gan_rkhs2019, author = {{Mehrjou}, Arash and {Jitkrittum}, Wittawat and {Sch{\"o}lkopf}, Bernhard and {Muandet}, Krikamol}, title = {{Witnessing Adversarial Training in Reproducing Kernel Hilbert Spaces}}, journal = {ArXiv}, keywords = {Computer Science  Machine Learning, Statistics  Machine Learning}, year = {2019}, month = jan, eid = {arXiv:1901.09206}, eprint = {1901.09206}, primaryclass = {cs.LG}, wj_http = {https://arxiv.org/abs/1901.09206} }

Fisher Efficient Inference of Intractable Models ArXiv 2018To appear in NeurIPS 2019Maximum Likelihood Estimators (MLE) has many good properties. For example, the asymptotic variance of MLE solution attains equality of the asymptotic CramérRao lower bound (efficiency bound), which is the minimum possible variance for an unbiased estimator. However, obtaining such MLE solution requires calculating the likelihood function which may not be tractable due to the normalization term of the density model. In this paper, we derive a Discriminative Likelihood Estimator (DLE) from the KullbackLeibler divergence minimization criterion implemented via density ratio estimation procedure and Stein operator. We study the problem of model inference using DLE. We prove its consistency and show the asymptotic variance of its solution can also attain the equality of the efficiency bound under mild regularity conditions. We also propose a dual formulation of DLE which can be easily optimized. Numerical studies validate our asymptotic theorems and we give an example where DLE successfully estimates an intractable model constructed using a pretrained deep neural network.
@article{2018arXiv180507454L, author = {{Liu}, Song and {Kanamori}, Takafumi and {Jitkrittum}, Wittawat and {Chen}, Yu}, title = {{Fisher Efficient Inference of Intractable Models}}, journal = {ArXiv}, keywords = {Statistics  Machine Learning, Computer Science  Machine Learning}, year = {2018}, month = may, eid = {arXiv:1805.07454}, pages = {arXiv:1805.07454}, archiveprefix = {arXiv}, eprint = {1805.07454}, primaryclass = {stat.ML}, wj_http = {https://arxiv.org/abs/1805.07454}, wj_note = {To appear in NeurIPS 2019} }

Large Sample Analysis of the Median Heuristic ArXiv 2018In kernel methods, the median heuristic has been widely used as a way of setting the bandwidth of RBF kernels. While its empirical performances make it a safe choice under many circumstances, there is little theoretical understanding of why this is the case. Our aim in this paper is to advance our understanding of the median heuristic by focusing on the setting of kernel twosample test. We collect new findings that may be of interest for both theoreticians and practitioners. In theory, we provide a convergence analysis that shows the asymptotic normality of the bandwidth chosen by the median heuristic in the setting of kernel twosample test. Systematic empirical investigations are also conducted in simple settings, comparing the performances based on the bandwidths chosen by the median heuristic and those by the maximization of test power.
@article{median_heu_2018, author = {{Garreau}, Damien and {Jitkrittum}, Wittawat and {Kanagawa}, Motonobu}, title = {Large Sample Analysis of the Median Heuristic}, journal = {ArXiv}, eprint = {1707.07269}, keywords = {Mathematics  Statistics Theory, Statistics  Machine Learning, 62E20, 62G30}, year = {2018}, wj_http = {https://arxiv.org/abs/1707.07269} }

Informative Features for Model Comparison NeurIPS 2018A lineartime test of relative goodness of fit of two models on a dataset. The test can produce evidence indicating where one model is better than the other. Applicable to implicit models such as GANs.
abstract paper bib poster code
Given two candidate models, and a set of target observations, we address the problem of measuring the relative goodness of fit of the two models. We propose two new statistical tests which are nonparametric, computationally efficient (runtime complexity is linear in the sample size), and interpretable. As a unique advantage, our tests can produce a set of examples (informative features) indicating the regions in the data domain where one model fits significantly better than the other. In a realworld problem of comparing GAN models, the test power of our new test matches that of the stateoftheart test of relative goodness of fit, while being one order of magnitude faster.@inproceedings{jitkrittum_kmod2018, title = {Informative Features for Model Comparison}, author = {Jitkrittum, Wittawat and Kanagawa, Heishiro and Sangkloy, Patsorn and Hays, James and Sch\"{o}lkopf, Bernhard and Gretton, Arthur}, booktitle = {NeurIPS}, year = {2018}, wj_http = {https://arxiv.org/abs/1810.11630}, wj_code = {https://github.com/wittawatj/kernelmod}, wj_poster = {/assets/poster/kmod_nips2018_poster.pdf}, wj_img = {cover_nips2018.png}, wj_summary = { A lineartime test of relative goodness of fit of two models on a dataset. The test can produce evidence indicating where one model is better than the other. Applicable to implicit models such as GANs. } }

A LinearTime Kernel GoodnessofFit Test NeurIPS 2017NeurIPS 2017 Best paper. 3 out of 3240 submissions.A lineartime test of goodness of fit of an unnormalized density function on a dataset. The test can produce evidence indicating where (in the data domain) the model does not fit well.
abstract paper bib poster code talk slides
We propose a novel adaptive test of goodnessoffit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein’s method, meaning that it is not necessary to compute the normalising constant of the model. We analyse the asymptotic Bahadur efficiency of the new test, and prove that under a meanshift alternative, our test always has greater relative efficiency than a previous lineartime kernel test, regardless of the choice of parameters for that test. In experiments, the performance of our method exceeds that of the earlier lineartime test, and matches or exceeds the power of a quadratictime kernel test. In high dimensions and where model structure may be exploited, our goodness of fit test performs far better than a quadratictime twosample test based on the Maximum Mean Discrepancy, with samples drawn from the model.@inproceedings{jitkrittum_lineartime_2017, title = {A LinearTime Kernel GoodnessofFit Test}, url = {http://arxiv.org/abs/1705.07673}, booktitle = {NeurIPS}, author = {Jitkrittum, Wittawat and Xu, Wenkai and Szabo, Zoltan and Fukumizu, Kenji and Gretton, Arthur}, year = {2017}, wj_img = {cover_nips2017.png}, wj_summary = { A lineartime test of goodness of fit of an unnormalized density function on a dataset. The test can produce evidence indicating where (in the data domain) the model does not fit well. }, wj_http = {http://arxiv.org/abs/1705.07673}, wj_code = {https://github.com/wittawatj/kernelgof}, wj_poster = {/assets/poster/kgof_nips2017_poster.pdf}, wj_slides = {/assets/talks/kgof_nips2017_oral.pdf}, wj_talk = {https://www.facebook.com/nipsfoundation/videos/1553635538061013/}, wj_highlight = {NeurIPS 2017 Best paper. 3 out of 3240 submissions.} }

KernelBased Distribution Features for Statistical Tests and Bayesian Inference 2017My PhD Thesis. Gatsby Unit, University College London.
@phdthesis{phdthesis2017, author = {Jitkrittum, Wittawat}, title = {KernelBased Distribution Features for Statistical Tests and {Bayesian} Inference}, school = {University College London}, year = {2017}, month = nov, url = {http://discovery.ucl.ac.uk/10037987/}, wj_http = {http://discovery.ucl.ac.uk/10037987/}, wj_highlight = {My PhD Thesis. Gatsby Unit, University College London.} }

An Adaptive Test of Independence with Analytic Kernel Embeddings ICML 2017
abstract paper bib poster code talk slides
A new computationally efficient dependence measure, and an adaptive statistical test of independence, are proposed. The dependence measure is the difference between analytic embeddings of the joint distribution and the product of the marginals, evaluated at a finite set of locations (features). These features are chosen so as to maximize a lower bound on the test power, resulting in a test that is dataefficient, and that runs in linear time (with respect to the sample size n). The optimized features can be interpreted as evidence to reject the null hypothesis, indicating regions in the joint domain where the joint distribution and the product of the marginals differ most. Consistency of the independence test is established, for an appropriate choice of features. In realworld benchmarks, independence tests using the optimized features perform comparably to the stateoftheart quadratictime HSIC test, and outperform competing O(n) and O(n log n) tests.@inproceedings{pmlrv70jitkrittum17a, title = {An Adaptive Test of Independence with Analytic Kernel Embeddings}, author = {Jitkrittum, Wittawat and Szab{\'o}, Zolt{\'a}n and Gretton, Arthur}, booktitle = {ICML}, year = {2017}, editor = {Precup, Doina and Teh, Yee Whye}, volume = {70}, url = {http://proceedings.mlr.press/v70/jitkrittum17a.html}, wj_http = {http://proceedings.mlr.press/v70/jitkrittum17a.html}, wj_code = {https://github.com/wittawatj/fsictest}, wj_slides = {/assets/talks/fsic_icml2017_oral.pdf}, wj_talk = {https://vimeo.com/255244123}, wj_poster = {/assets/poster/fsic_icml2017_poster.pdf} }

Cognitive Bias in Ambiguity Judgements: Using Computational Models to Dissect the Effects of Mild Mood Manipulation in Humans PLOS ONE 2016
@article{iigaya2016, title = {Cognitive Bias in Ambiguity Judgements: Using Computational Models to Dissect the Effects of Mild Mood Manipulation in Humans}, author = {Iigaya, Kiyohito and Jolivald, Aurelie and Jitkrittum, Wittawat and Gilchrist, Iain and Dayan, Peter and Paul, Elizabeth and Mendl, Michael}, year = {2016}, month = oct, journal = {PLOS ONE}, issn = {19326203}, publisher = {Public Library of Science}, wj_http = {http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0165840} }

Interpretable Distribution Features with Maximum Testing Power NeurIPS 2016Oral presentation. 1.8% of total submission.
abstract paper bib poster code talk slides
Two semimetrics on probability distributions are proposed, given as the sum of differences of expectations of analytic functions evaluated at spatial or frequency locations (i.e, features). The features are chosen so as to maximize the distinguishability of the distributions, by optimizing a lower bound on test power for a statistical test using these features. The result is a parsimonious and interpretable indication of how and where two distributions differ locally. An empirical estimate of the test power criterion converges with increasing sample size, ensuring the quality of the returned features. In realworld benchmarks on highdimensional text and image data, lineartime tests using the proposed semimetrics achieve comparable performance to the stateoftheart quadratictime maximum mean discrepancy test, while returning humaninterpretable features that explain the test results.@inproceedings{fotest2016, author = {Jitkrittum, Wittawat and Szab\'{o}, Zolt\'{a}n and Chwialkowski, Kacper and Gretton, Arthur}, title = {Interpretable Distribution Features with Maximum Testing Power}, booktitle = {NeurIPS}, year = {2016}, url = {http://papers.nips.cc/paper/6148interpretabledistributionfeatureswithmaximumtestingpower}, wj_http = {http://papers.nips.cc/paper/6148interpretabledistributionfeatureswithmaximumtestingpower}, wj_poster = {/assets/poster/fotest_poster.pdf}, wj_code = {https://github.com/wittawatj/interpretabletest}, wj_talk = {https://channel9.msdn.com/Events/NeuralInformationProcessingSystemsConference/NeuralInformationProcessingSystemsConferenceNIPS2016/InterpretableDistributionFeatureswithMaximumTestingPower}, wj_slides = {/assets/talks/fotest_oral.pdf}, wj_highlight = {Oral presentation. 1.8\% of total submission.} }

K2ABC: Approximate Bayesian Computation with Infinite Dimensional Summary Statistics via Kernel Embeddings AISTATS 2016Oral presentation. 6.5% of total submissions.
abstract paper bib poster code slides
Complicated generative models often result in a situation where computing the likelihood of observed data is intractable, while simulating from the conditional density given a parameter value is relatively easy. Approximate Bayesian Computation (ABC) is a paradigm that enables simulationbased posterior inference in such cases by measuring the similarity between simulated and observed data in terms of a chosen set of summary statistics. However, there is no general rule to construct sufficient summary statistics for complex models. Insufficient summary statistics will "leak" information, which leads to ABC algorithms yielding samples from an incorrect (partial) posterior. In this paper, we propose a fully nonparametric ABC paradigm which circumvents the need for manually selecting summary statistics. Our approach, K2ABC, uses maximum mean discrepancy (MMD) as a dissimilarity measure between the distributions over observed and simulated data. MMD is easily estimated as the squared difference between their empirical kernel embeddings. Experiments on a simulated scenario and a realworld biological problem illustrate the effectiveness of the proposed algorithm.@inproceedings{part_k2abc_2015_arxiv, author = {Park, Mijung and Jitkrittum, Wittawat and Sejdinovic, Dino}, title = {{K2ABC}: Approximate {B}ayesian Computation with Infinite Dimensional Summary Statistics via Kernel Embeddings}, booktitle = {AISTATS}, year = {2016}, url = {http://jmlr.org/proceedings/papers/v51/park16.html}, wj_http = {http://jmlr.org/proceedings/papers/v51/park16.html}, wj_poster = {/assets/poster/k2abc_AISTATS2016_poster.pdf}, wj_code = {https://github.com/wittawatj/k2abc}, wj_slides = {/assets/talks/k2abc_AISTATS2016.pdf}, wj_highlight = {Oral presentation. 6.5\% of total submissions.} }

Bayesian Manifold Learning: The Locally Linear Latent Variable Model NeurIPS 2015
abstract paper bib code slides
We introduce the Locally Linear Latent Variable Model (LLLVM), a probabilistic model for nonlinear manifold discovery that describes a joint distribution over observations, their manifold coordinates and locally linear maps conditioned on a set of neighbourhood relationships. The model allows straightforward variational optimisation of the posterior distribution on coordinates and locally linear maps from the latent space to the observation space given the data. Thus, the LLLVM encapsulates the localgeometry preserving intuitions that underlie nonprobabilistic methods such as locally linear embedding (LLE). Its probabilistic semantics make it easy to evaluate the quality of hypothesised neighbourhood relationships, select the intrinsic dimensionality of the manifold, construct outofsample extensions and to combine the manifold model with additional probabilistic models that capture the structure of coordinates within the manifold.@inproceedings{Park2015, author = {Park, Mijung and Jitkrittum, Wittawat and Qamar, Ahmad and Szab\'{o}, Zolt\'{a}n and Buesing, Lars and Sahani, Maneesh}, title = {Bayesian Manifold Learning: The Locally Linear Latent Variable Model}, booktitle = {NeurIPS}, year = {2015}, url = {http://arxiv.org/abs/1410.6791}, wj_http = {http://arxiv.org/abs/1410.6791}, wj_code = {https://github.com/mijungi/lllvm}, wj_slides = {/assets/talks/csml_lllvm.pdf} }

KernelBased JustInTime Learning for Passing Expectation Propagation Messages UAI 2015
abstract paper bib poster code
We propose an efficient nonparametric strategy for learning a message operator in expectation propagation (EP), which takes as input the set of incoming messages to a factor node, and produces an outgoing message as output. This learned operator replaces the multivariate integral required in classical EP, which may not have an analytic expression. We use kernelbased regression, which is trained on a set of probability distributions representing the incoming messages, and the associated outgoing messages. The kernel approach has two main advantages: first, it is fast, as it is implemented using a novel twolayer random feature representation of the input message distributions; second, it has principled uncertainty estimates, and can be cheaply updated online, meaning it can request and incorporate new training data when it encounters inputs on which it is uncertain. In experiments, our approach is able to solve learning problems where a single message operator is required for multiple, substantially different data sets (logistic regression for a variety of classification problems), where it is essential to accurately assess uncertainty and to efficiently and robustly update the message operator.@inproceedings{jitkrittum_kernelbased_2015, title = {KernelBased JustInTime Learning for Passing Expectation Propagation Messages}, author = {Jitkrittum, Wittawat and Gretton, Arthur and Heess, Nicolas and Eslami, S. M. Ali and Lakshminarayanan, Balaji and Sejdinovic, Dino and Szab\'{o}, Zolt\'{a}n}, url = {http://arxiv.org/abs/1503.02551}, booktitle = {UAI}, year = {2015}, wj_http = {http://arxiv.org/abs/1503.02551}, wj_poster = {/assets/poster/kjit_uai2015_poster.pdf}, wj_code = {https://github.com/wittawatj/kernelep} }

Performance of synchrony and spectralbased features in early seizure detection: exploring feature combinations and effect of latency International Workshop on Seizure Prediction (IWSP) 2015: Epilepsy Mechanisms, Models, Prediction and Control 2015
@misc{adam+al:2015:iwsp7, author = {Adam, Vincent and SoldadoMagraner, Joana and Jitkrittum, Wittawat and Strathmann, Heiko and Lakshminarayanan, Balaji and Ialongo, Alessandro Davide and Bohner, Gergo and Huh, Ben Dongsung and Goetz, Lea and Dowling, Shaun and Serban, Iulian Vlad and Louis, Matthieu}, title = {Performance of synchrony and spectralbased features in early seizure detection: exploring feature combinations and effect of latency}, booktitle = {International Workshop on Seizure Prediction (IWSP) 2015: Epilepsy Mechanisms, Models, Prediction and Control}, year = {2015}, wj_http = {http://www.iwsp7.org/}, wj_code = {https://github.com/vincentadam87/gatsbyhackathonseizure} }

HighDimensional Feature Selection by FeatureWise Kernelized Lasso Neural Computation 2014The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. The least absolute shrinkage and selection operator (Lasso) allows computationally efficient feature selection based on linear dependency between input features and output values. In this letter, we consider a featurewise kernelized Lasso for capturing nonlinear inputoutput dependency. We first show that with particular choices of kernel functions, nonredundant features with strong statistical dependence on output values can be found in terms of kernelbased independence measures such as the HilbertSchmidt independence criterion. We then show that the globally optimal solution can be efficiently computed; this makes the approach scalable to highdimensional problems. The effectiveness of the proposed method is demonstrated through feature selection experiments for classification and regression with thousands of features.
@article{YamadaJSXS14, author = {Yamada, Makoto and Jitkrittum, Wittawat and Sigal, Leonid and Xing, Eric P. and Sugiyama, Masashi}, title = {HighDimensional Feature Selection by FeatureWise Kernelized Lasso}, journal = {Neural Computation}, volume = {26}, number = {1}, year = {2014}, pages = {185207}, url = {http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00537#.U9O7Idtsylg}, wj_http = {http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00537#.U9O7Idtsylg}, wj_code = {http://www.makotoyamadaml.com/hsiclasso.html} }

Feature Selection via L1Penalized SquaredLoss Mutual Information IEICE Transactions 2013Feature selection is a technique to screen out less important features. Many existing supervised feature selection algorithms use redundancy and relevancy as the main criteria to select features. However, feature interaction, potentially a key characteristic in realworld problems, has not received much attention. As an attempt to take feature interaction into account, we propose L1LSMI, an L1regularization based algorithm that maximizes a squaredloss variant of mutual information between selected features and outputs. Numerical results show that L1LSMI performs well in handling redundancy, detecting nonlinear dependency, and considering feature interaction.
@article{Jitkrittum2013, author = {Jitkrittum, Wittawat and Hachiya, Hirotaka and Sugiyama, Masashi}, title = {Feature Selection via L1Penalized SquaredLoss Mutual Information}, journal = {IEICE Transactions}, year = {2013}, volume = {96D}, pages = {15131524}, number = {7}, wj_pdf = {http://wittawat.com/pages/files/L1LSMI.pdf}, wj_code = {https://github.com/wittawatj/l1lsmi} }

Squaredloss Mutual Information Regularization: A Novel Informationtheoretic Approach to Semisupervised Learning ICML 2013We propose squaredloss mutual information regularization (SMIR) for multiclass probabilistic classification, following the information maximization principle. SMIR is convex under mild conditions and thus improves the nonconvexity of mutual information regularization. It offers all of the following four abilities to semisupervised algorithms: Analytical solution, outofsample/multiclass classification, and probabilistic output. Furthermore, novel generalization error bounds are derived. Experiments show SMIR compares favorably with stateoftheart methods.
@inproceedings{Niu2013, author = {Niu, Gang and Jitkrittum, Wittawat and Dai, Bo and Hachiya, Hirotaka and Sugiyama, Masashi}, title = {Squaredloss Mutual Information Regularization: A Novel Informationtheoretic Approach to Semisupervised Learning}, booktitle = { ICML}, year = {2013}, volume = {28}, pages = {1018}, url = {http://jmlr.org/proceedings/papers/v28/niu13.pdf}, wj_pdf = {http://jmlr.org/proceedings/papers/v28/niu13.pdf}, wj_code = {https://github.com/wittawatj/smir} }

QAST: Question Answering System for Thai Wikipedia Proceedings of the 2009 Workshop on Knowledge and Reasoning for Answering Questions 2009We propose an opendomain question answering system using Thai Wikipedia as the knowledge base. Two types of information are used for answering a question: (1) structured information extracted and stored in the form of Resource Description Framework (RDF), and (2) unstructured texts stored as a search index. For the structured information, SPARQL transformed query is applied to retrieve a short answer from the RDF base. For the unstructured information, keywordbased query is used to retrieve the shortest text span containing the questions’s key terms. From the experimental results, the system which integrates both approaches could achieve an average MRR of 0.47 based on 215 test questions.
@inproceedings{Jitkrittum2009, author = {Jitkrittum, Wittawat and Haruechaiyasak, Choochart and Theeramunkong, Thanaruk}, title = {{QAST}: Question Answering System for {Thai} Wikipedia}, booktitle = {Proceedings of the 2009 Workshop on Knowledge and Reasoning for Answering Questions}, year = {2009}, series = {KRAQ '09}, pages = {1114}, publisher = {Association for Computational Linguistics}, url = {http://dl.acm.org/citation.cfm?id=1697288.1697291}, wj_http = {http://dl.acm.org/citation.cfm?id=1697288.1697291} }

Implementing News Article Category Browsing Based on Text Categorization Technique Web Intelligence/IAT Workshops 2008We propose a feature called category browsing to enhance the fulltext search function of Thailanguage news article search engine. The category browsing allows users to browse and filter search results based on some predefined categories. To implement the category browsing feature, we applied and compared among several text categorization algorithms including decision tree, Naive Bayes (NB) and Support Vector Machines (SVM). To further increase the performance of text categorization, we performed evaluation among many feature selection techniques including document frequency thresholding (DF), information gain (IG) and x2 (CHI). Based on our experiments using a large news corpus, the SVM algorithm with the IG feature selection yielded the best performance with the F1 measure equal to 95.42%.
@inproceedings{Haruechaiyasak2008, author = {Haruechaiyasak, Choochart and Jitkrittum, Wittawat and Sangkeettrakarn, Chatchawal and Damrongrat, Chaianun}, title = {Implementing News Article Category Browsing Based on Text Categorization Technique}, booktitle = {Web Intelligence/IAT Workshops}, year = {2008}, pages = {143146}, ee = {http://dx.doi.org/10.1109/WIIAT.2008.61}, wj_http = {http://dx.doi.org/10.1109/WIIAT.2008.61} }

ProximityBased Semantic Relatedness Measurement on Thai Wikipedia International Conference on Knowledge, Information and Creativity Support Systems (KICSS) 2008
@inproceedings{proximity2008, author = {Jitkrittum, Wittawat and Theeramunkong, Thanaruk and Haruechaiyasak, Choochart}, title = {ProximityBased Semantic Relatedness Measurement on {Thai} {Wikipedia}}, booktitle = {International Conference on Knowledge, Information and Creativity Support Systems (KICSS)}, year = {2008} }

Managing Offline Educational Web Contents with Search Engine Tools International Conference on Asian Digital Libraries 2007In this paper, we describe our ongoing project to help alleviate the digital divide problem among high schools in rural areas of Thailand. The idea is to select, organize, index and distribute useful educational Web contents to schools where the Internet connection is not available. These Web contents can be used by teachers and students to enhance the teaching and learning for many class subjects. We have collaborated with a group of teachers from different high schools in order to gather the requirements for designing our software tools. One of the challenging issues is the variation in computer hardwares and network configuration found in different schools. Some shools have PCs connected to the school’s server via the Local Area Network (LAN). While some other schools have lowperformance PCs without any network connection. To support both cases, we provide two solutions via two different search engine tools. These tools support content administrators, e.g., teachers, with the features to organize and index the contents. The tools also provide general users with the features to browse and search for needed information. Since the contents and index are locally stored on hard disk or some removable media such as CDROM, the Internet connection is not needed.
@inproceedings{Haruechaiyasak2007, author = {Haruechaiyasak, Choochart and Sangkeettrakarn, Chatchawal and Jitkrittum, Wittawat}, title = {Managing Offline Educational Web Contents with Search Engine Tools}, booktitle = {International Conference on Asian Digital Libraries}, year = {2007}, pages = {444453}, ee = {http://dx.doi.org/10.1007/9783540770947_56}, wj_http = {http://dx.doi.org/10.1007/9783540770947_56} }