1. ABCDP: Approximate Bayesian Computation with Differential Privacy Park, Mijung and Vinaroz, Margarita and Jitkrittum, Wittawat Entropy 2021
We developed a novel approximate Bayesian computation (ABC) framework, ABCDP, which produces differentially private (DP) and approximate posterior samples. Our framework takes advantage of the sparse vector technique (SVT), widely studied in the differential privacy literature. SVT incurs the privacy cost only when a condition (whether a quantity of interest is above/below a threshold) is met. If the condition is sparsely met during the repeated queries, SVT can drastically reduce the cumulative privacy loss, unlike the usual case where every query incurs the privacy loss. In ABC, the quantity of interest is the distance between observed and simulated data, and only when the distance is below a threshold can we take the corresponding prior sample as a posterior sample. Hence, applying SVT to ABC is an organic way to transform an ABC algorithm to a privacy-preserving variant with minimal modification, but yields the posterior samples with a high privacy level. We theoretically analyzed the interplay between the noise added for privacy and the accuracy of the posterior samples. We apply ABCDP to several data simulators and show the efficacy of the proposed framework.
@article{abcdp2021,
author = {Park, Mijung and Vinaroz, Margarita and Jitkrittum, Wittawat},
title = {ABCDP: Approximate Bayesian Computation with Differential Privacy},
journal = {Entropy},
volume = {23},
year = {2021},
number = {8},
article-number = {961},
url = {https://www.mdpi.com/1099-4300/23/8/961},
issn = {1099-4300},
doi = {10.3390/e23080961},
wj_http = {https://www.mdpi.com/1099-4300/23/8/961}
}


2. Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces Rawat, Ankit Singh and Menon, Aditya Krishna and Jitkrittum, Wittawat and Jayasumana, Sadeep and Yu, Felix X. and Reddi, Sashank and Kumar, Sanjiv ICML 2021
Negative sampling schemes enable efficient training given a large number of classes, by offering a means to approximate a computationally expensive loss function that takes all labels into account. In this paper, we present a new connection between these schemes and loss modification techniques for countering label imbalance. We show that different negative sampling schemes implicitly trade-off performance on dominant versus rare labels. Further, we provide a unified means to explicitly tackle both sampling bias, arising from working with a subset of all labels, and labeling bias, which is inherent to the data due to label imbalance. We empirically verify our findings on long-tail classification and retrieval benchmarks.
@inproceedings{rawat2021,
title = {Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces},
author = {Rawat, Ankit Singh and Menon, Aditya Krishna and Jitkrittum, Wittawat and Jayasumana, Sadeep and Yu, Felix X. and Reddi, Sashank and Kumar, Sanjiv},
booktitle = {ICML},
year = {2021},
wj_http = {https://arxiv.org/abs/2105.05736}
}


3. A Witness Two-Sample Test Kübler, Jonas M. and Jitkrittum, Wittawat and Schölkopf, Bernhard and Muandet, Krikamol ArXiv 2021
We propose data-dependent test statistics based on a one-dimensional witness function, which we call witness two-sample tests (WiTS tests). We first optimize the witness function by maximizing an asymptotic test-power objective and then use as the test statistic the difference in means of the witness evaluated on two held-out test samples. When the witness function belongs to a reproducing kernel Hilbert space, we show that the optimal witness is given via kernel Fisher discriminant analysis, whose solution we compute in closed form. We show that the WiTS test based on a characteristic kernel is consistent against any fixed alternative. Our experiments demonstrate that the WiTS test can achieve higher test power than existing two-sample tests with optimized kernels, suggesting that learning a high- or infinite-dimensional representation of the data may not be necessary for two-sample testing. The proposed procedure works beyond kernel methods, allowing practitioners to apply it within their preferred machine learning framework.
@misc{kuebler2021witness,
title = {A Witness Two-Sample Test},
author = {Kübler, Jonas M. and Jitkrittum, Wittawat and Schölkopf, Bernhard and Muandet, Krikamol},
journal = {ArXiv},
year = {2021},
eprint = {2102.05573},
archiveprefix = {arXiv},
primaryclass = {cs.LG},
wj_http = {https://arxiv.org/abs/2102.05573}
}


4. Kernel Distributionally Robust Optimization: Generalized Duality Theorem and Stochastic Approximation Zhu, Jia-Jie and Jitkrittum, Wittawat and Diehl, Moritz and Schölkopf, Bernhard Proceedings of The 24th International Conference on Artificial Intelligence and Statistics 2021
We propose kernel distributionally robust optimization (Kernel DRO) using insights from the robust optimization theory and functional analysis. Our method uses reproducing kernel Hilbert spaces (RKHS) to construct a wide range of convex ambiguity sets, which can be generalized to sets based on integral probability metrics and finite-order moment bounds. This perspective unifies multiple existing robust and stochastic optimization methods. We prove a theorem that generalizes the classical duality in the mathematical problem of moments. Enabled by this theorem, we reformulate the maximization with respect to measures in DRO into the dual program that searches for RKHS functions. Using universal RKHSs, the theorem applies to a broad class of loss functions, lifting common limitations such as polynomial losses and knowledge of the Lipschitz constant. We then establish a connection between DRO and stochastic optimization with expectation constraints. Finally, we propose practical algorithms based on both batch convex solvers and stochastic functional gradient, which apply to general optimization and machine learning tasks.
@inproceedings{zhu2021kernel,
title = { Kernel Distributionally Robust Optimization: Generalized Duality Theorem and Stochastic Approximation },
author = {Zhu, Jia-Jie and Jitkrittum, Wittawat and Diehl, Moritz and Sch{\"o}lkopf, Bernhard},
booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics},
pages = {280--288},
year = {2021},
editor = {Banerjee, Arindam and Fukumizu, Kenji},
volume = {130},
series = {Proceedings of Machine Learning Research},
month = {13--15 Apr},
publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v130/zhu21a/zhu21a.pdf},
url = {http://proceedings.mlr.press/v130/zhu21a.html},
wj_http = {http://proceedings.mlr.press/v130/zhu21a.html}
}


5. More Powerful Selective Kernel Tests for Feature Selection Lim, Jen Ning and Yamada, Makoto and Jitkrittum, Wittawat and Terada, Yoshikazu and Matsui, Shigeyuki and Shimodaira, Hidetoshi Proceedings of The 23rd International Conference on Artificial Intelligence and Statistics 2020
Refining one’s hypotheses in light of data is a commonplace scientific practice, however,this approach introduces selection bias and can lead to specious statisticalanalysis.One approach of addressing this phenomena is via conditioning on the selection procedure, i.e., how we have used the data to generate our hypotheses, and prevents information to be used again after selection.Many selective inference (a.k.a. post-selection inference) algorithms typically take this approach but will “over-condition”for sake of tractability. While this practice obtains well calibrated p-values,it can incur a major loss in power. In our work, we extend two recent proposals for selecting features using the Maximum Mean Discrepancyand Hilbert Schmidt Independence Criterion to condition on the minimalconditioning event. We show how recent advances inmultiscale bootstrap makesthis possible and demonstrate our proposal over a range of synthetic and real world experiments.Our results show that our proposed test is indeed more powerful in most scenarios.
@inproceedings{pmlr-v108-lim20a,
title = {More Powerful Selective Kernel Tests for Feature Selection},
author = {Lim, Jen Ning and Yamada, Makoto and Jitkrittum, Wittawat and Terada, Yoshikazu and Matsui, Shigeyuki and Shimodaira, Hidetoshi},
booktitle = {Proceedings of The 23rd International Conference on Artificial Intelligence and Statistics},
pages = {820--830},
year = {2020},
editor = {Chiappa, Silvia and Calandra, Roberto},
volume = {108},
series = {Proceedings of Machine Learning Research},
month = {26--28 Aug},
publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v108/lim20a/lim20a.pdf},
url = {http://proceedings.mlr.press/v108/lim20a.html},
wj_http = {http://proceedings.mlr.press/v108/lim20a.html},
wj_code = {https://github.com/jenninglim/multiscale-features}
}


6. Learning Kernel Tests Without Data Splitting Kübler, Jonas M. and Jitkrittum, Wittawat and Schölkopf, Bernhard and Muandet, Krikamol NeurIPS 2020
Modern large-scale kernel-based tests such as maximum mean discrepancy (MMD) and kernelized Stein discrepancy (KSD) optimize kernel hyperparameters on a held-out sample via data splitting to obtain the most powerful test statistics. While data splitting results in a tractable null distribution, it suffers from a reduction in test power due to smaller test sample size. Inspired by the selective inference framework, we propose an approach that enables learning the hyperparameters and testing on the full sample without data splitting. Our approach can correctly calibrate the test in the presence of such dependency, and yield a test threshold in closed form. At the same significance level, our approach’s test power is empirically larger than that of the data-splitting approach, regardless of its split proportion.
@incollection{kbler2020learning,
title = {Learning Kernel Tests Without Data Splitting},
booktitle = {NeurIPS},
author = {K\"{u}bler, Jonas M. and Jitkrittum, Wittawat and Sch\"{o}lkopf, Bernhard and Muandet, Krikamol},
year = {2020},
wj_http = {https://arxiv.org/abs/2006.02286},
wj_code = {https://github.com/MPI-IS/tests-wo-splitting},
wj_talk = {https://nips.cc/virtual/2020/protected/poster_44f683a84163b3523afe57c2e008bc8c.html}
}


7. Worst-Case Risk Quantification under Distributional Ambiguity using Kernel Mean Embedding in Moment Problem Zhu, Jia-Jie and Jitkrittum, Wittawat and Diehl, Moritz and Schölkopf, Bernhard ArXiv 2020
In order to anticipate rare and impactful events, we propose to quantify the worst-case risk under distributional ambiguity using a recent development in kernel methods – the kernel mean embedding. Specifically, we formulate the generalized moment problem whose ambiguity set (i.e., the moment constraint) is described by constraints in the associated reproducing kernel Hilbert space in a nonparametric manner. We then present the tractable optimization formulation and its theoretical justification. As a concrete application, we numerically test the proposed method in characterizing the worst-case constraint violation probability in the context of a constrained stochastic control system.
@misc{zhu2020worstcase,
title = {Worst-Case Risk Quantification under Distributional Ambiguity using Kernel Mean Embedding in Moment Problem},
author = {Zhu, Jia-Jie and Jitkrittum, Wittawat and Diehl, Moritz and Schölkopf, Bernhard},
journal = {ArXiv},
year = {2020},
eprint = {2004.00166},
archiveprefix = {arXiv},
primaryclass = {math.OC},
wj_http = {https://arxiv.org/abs/2004.00166}
}


8. Testing Goodness of Fit of Conditional Density Models with Kernels Jitkrittum, Wittawat and Kanagawa, Heishiro and Schölkopf, Bernhard The Conference on Uncertainty in Artificial Intelligence (UAI) 2020
We propose two nonparametric statistical tests of goodness of fit for conditional distributions: given a conditional probability density function p(y|x) and a joint sample, decide whether the sample is drawn from p(y|x)rx(x) for some density rx. Our tests, formulated with a Stein operator, can be applied to any differentiable conditional density model, and require no knowledge of the normalizing constant. We show that 1) our tests are consistent against any fixed alternative conditional model; 2) the statistics can be estimated easily, requiring no density estimation as an intermediate step; and 3) our second test offers an interpretable test result providing insight on where the conditional model does not fit well in the domain of the covariate. We demonstrate the interpretability of our test on a task of modeling the distribution of New York City’s taxi drop-off location given a pick-up point. To our knowledge, our work is the first to propose such conditional goodness-of-fit tests that simultaneously have all these desirable properties.
@conference{jitkrittum2020testing,
title = {Testing Goodness of Fit of Conditional Density Models with Kernels},
author = {Jitkrittum, Wittawat and Kanagawa, Heishiro and Sch\"{o}lkopf, Bernhard},
booktitle = {The Conference on Uncertainty in Artificial Intelligence (UAI)},
year = {2020},
url = {https://arxiv.org/abs/2002.10271},
wj_http = {https://arxiv.org/abs/2002.10271},
wj_code = {https://github.com/wittawatj/kernel-cgof},
}


9. Kernel Conditional Moment Test via Maximum Moment Restriction Muandet, Krikamol and Jitkrittum, Wittawat and Kübler, Jonas The Conference on Uncertainty in Artificial Intelligence (UAI) 2020
We propose a new family of specification tests called kernel conditional moment (KCM) tests. Our tests are built on conditional moment embeddings (CMME)—a novel representation of conditional moment restrictions in a reproducing kernel Hilbert space (RKHS). After transforming the conditional moment restrictions into a continuum of unconditional counterparts, the test statistic is defined as the maximum moment restriction within the unit ball of the RKHS. We show that the CMME fully characterizes the original conditional moment restrictions, leading to consistency in both hypothesis testing and parameter estimation. The proposed test also has an analytic expression that is easy to compute as well as closed-form asymptotic distributions. Our empirical studies show that the KCM test has a promising finite-sample performance compared to existing tests.
@conference{mmrtest2020,
author = {{Muandet}, Krikamol and {Jitkrittum}, Wittawat and {K{\"u}bler}, Jonas},
booktitle = {The Conference on Uncertainty in Artificial Intelligence (UAI)},
title = {Kernel Conditional Moment Test via Maximum Moment Restriction},
year = {2020},
wj_http = {https://arxiv.org/abs/2002.09225},
wj_code = {https://github.com/krikamol/kcm-test}
}


10. Kernel Stein Tests for Multiple Model Comparison Lim, Jen Ning and Yamada, Makoto and Schölkopf, Bernhard and Jitkrittum, Wittawat NeurIPS 2019
We address the problem of nonparametric multiple model comparison: given l candidate models, decide whether each candidate is as good as the best one(s) in the list (negative), or worse (positive). We propose two statistical tests, each controlling a different notion of decision errors. The first test, building on the post selection inference framework, provably controls the fraction of best models that are wrongly declared worse (false positive rate). The second test is based on multiple correction, and controls the fraction of the models declared worse that are in fact as good as the best (false discovery rate). We prove that under some conditions the first test can yield a higher true positive rate than the second. Experimental results on toy and real (CelebA, Chicago Crime data) problems show that the two tests have high true positive rates with well-controlled error rates. By contrast, the naive approach of choosing the model with the lowest score without correction leads to a large number of false positives.
@incollection{NIPS2019_8496,
title = {Kernel Stein Tests for Multiple Model Comparison},
author = {Lim, Jen Ning and Yamada, Makoto and Sch\"{o}lkopf, Bernhard and Jitkrittum, Wittawat},
booktitle = {NeurIPS},
editor = {Wallach, H. and Larochelle, H. and Beygelzimer, A. and d\textquotesingle Alch\'{e}-Buc, F. and Fox, E. and Garnett, R.},
pages = {2240--2250},
year = {2019},
publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/8496-kernel-stein-tests-for-multiple-model-comparison.pdf},
wj_http = {https://arxiv.org/abs/1910.12252},
wj_code = {https://github.com/jenninglim/model-comparison-test},
wj_poster = {https://github.com/jenninglim/model-comparison-test/raw/master/docs/poster.pdf}
}


11. Fisher Efficient Inference of Intractable Models Liu, Song and Kanamori, Takafumi and Jitkrittum, Wittawat and Chen, Yu NeurIPS 2019
Maximum Likelihood Estimators (MLE) has many good properties. For example, the asymptotic variance of MLE solution attains equality of the asymptotic Cramér-Rao lower bound (efficiency bound), which is the minimum possible variance for an unbiased estimator. However, obtaining such MLE solution requires calculating the likelihood function which may not be tractable due to the normalization term of the density model. In this paper, we derive a Discriminative Likelihood Estimator (DLE) from the Kullback-Leibler divergence minimization criterion implemented via density ratio estimation procedure and Stein operator. We study the problem of model inference using DLE. We prove its consistency and show the asymptotic variance of its solution can also attain the equality of the efficiency bound under mild regularity conditions. We also propose a dual formulation of DLE which can be easily optimized. Numerical studies validate our asymptotic theorems and we give an example where DLE successfully estimates an intractable model constructed using a pre-trained deep neural network.
@incollection{NIPS2019_9083,
title = {Fisher Efficient Inference of Intractable Models},
author = {Liu, Song and Kanamori, Takafumi and Jitkrittum, Wittawat and Chen, Yu},
booktitle = {NeurIPS},
editor = {Wallach, H. and Larochelle, H. and Beygelzimer, A. and d\textquotesingle Alch\'{e}-Buc, F. and Fox, E. and Garnett, R.},
pages = {8790--8800},
year = {2019},
publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/9083-fisher-efficient-inference-of-intractable-models.pdf},
wj_http = {https://arxiv.org/abs/1805.07454},
wj_code = {https://github.com/anewgithubname/Stein-Density-Ratio-Estimation},
wj_slides = {https://github.com/lamfeeling/Stein-Density-Ratio-Estimation/blob/master/slides.pdf}
}


12. A Kernel Stein Test for Comparing Latent Variable Models Kanagawa, Heishiro and Jitkrittum, Wittawat and Mackey, Lester and Fukumizu, Kenji and Gretton, Arthur ArXiv 2019
We propose a nonparametric, kernel-based test to assess the relative goodness of fit of latent variable models with intractable unnormalized densities. Our test generalises the kernel Stein discrepancy (KSD) tests of (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018, Jitkrittum et al., 2018) which required exact access to unnormalized densities. Our new test relies on the simple idea of using an approximate observed-variable marginal in place of the exact, intractable one. As our main theoretical contribution, we prove that the new test, with a properly corrected threshold, has a well-controlled type-I error. In the case of models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative maximum mean discrepancy test (Bounliphone et al., 2015) , which cannot exploit the latent structure.
@article{latent_stein_test2019,
author = {{Kanagawa}, Heishiro and {Jitkrittum}, Wittawat and {Mackey}, Lester and {Fukumizu}, Kenji and {Gretton}, Arthur},
title = {A Kernel Stein Test for Comparing Latent Variable Models},
journal = {ArXiv},
keywords = {Statistics - Machine Learning, Computer Science - Machine Learning},
year = {2019},
month = jul,
eid = {arXiv:1907.00586},
pages = {arXiv:1907.00586},
archiveprefix = {arXiv},
eprint = {1907.00586},
primaryclass = {stat.ML},
wj_http = {https://arxiv.org/abs/1907.00586}
}


13. Generate Semantically Similar Images with Kernel Mean Matching Jitkrittum, Wittawat and Sangkloy, Patsorn and Gondal, Muhammad Waleed and Raj, Amit and Hays, James and Schölkopf, Bernhard Women in Computer Vision Workshop, CVPR 2019
Oral presentation. 6 out of 64 accepted papers.
@misc{cagan_wicv2019,
author = {Jitkrittum, Wittawat and Sangkloy, Patsorn and Gondal, Muhammad Waleed and Raj, Amit and Hays, James and {Sch{\"o}lkopf}, Bernhard},
title = {Generate Semantically Similar Images with Kernel Mean Matching},
howpublished = {Women in Computer Vision Workshop, CVPR},
year = {2019},
note = {The first two authors contributed equally. },
wj_http = {/assets/papers/cagan_wicv2019.pdf},
wj_highlight = {Oral presentation. 6 out of 64 accepted papers.},
}


14. Kernel Mean Matching for Content Addressability of GANs Jitkrittum, Wittawat and Sangkloy, Patsorn and Gondal, Muhammad Waleed and Raj, Amit and Hays, James and Schölkopf, Bernhard ICML 2019
Long oral presentation
We propose a novel procedure which adds "content-addressability" to any given unconditional implicit model e.g., a generative adversarial network (GAN). The procedure allows users to control the generative process by specifying a set (arbitrary size) of desired examples based on which similar samples are generated from the model. The proposed approach, based on kernel mean matching, is applicable to any generative models which transform latent vectors to samples, and does not require retraining of the model. Experiments on various high-dimensional image generation problems (CelebA-HQ, LSUN bedroom, bridge, tower) show that our approach is able to generate images which are consistent with the input set, while retaining the image quality of the original model. To our knowledge, this is the first work that attempts to construct, at test time, a content-addressable generative model from a trained marginal model.
@inproceedings{cagan_icml2019,
author = {Jitkrittum, Wittawat and Sangkloy, Patsorn and Gondal, Muhammad Waleed and Raj, Amit and Hays, James and {Sch{\"o}lkopf}, Bernhard},
title = {Kernel Mean Matching for Content Addressability of {GANs}},
booktitle = {ICML},
year = {2019},
note = {The first two authors contributed equally.},
wj_http = {https://arxiv.org/abs/1905.05882},
wj_talk = {https://slideslive.com/38917639/applications-computer-vision},
wj_highlight = {Long oral presentation}
}


15. Kernel-Guided Training of Implicit Generative Models with Stability Guarantees Mehrjou, Arash and Jitkrittum, Wittawat and Muandet, Krikamol and Schölkopf, Bernhard ArXiv 2019
Modern implicit generative models such as generative adversarial networks (GANs) are generally known to suffer from issues such as instability, uninterpretability, and difficulty in assessing their performance. If we see these implicit models as dynamical systems, some of these issues are caused by being unable to control their behavior in a meaningful way during the course of training. In this work, we propose a theoretically grounded method to guide the training trajectories of GANs by augmenting the GAN loss function with a kernel-based regularization term that controls local and global discrepancies between the model and true distributions. This control signal allows us to inject prior knowledge into the model. We provide theoretical guarantees on the stability of the resulting dynamical system and demonstrate different aspects of it via a wide range of experiments.
@article{witness_gan_rkhs2019,
author = {{Mehrjou}, Arash and {Jitkrittum}, Wittawat and {Muandet}, Krikamol and {Sch{\"o}lkopf}, Bernhard},
title = {{Kernel-Guided Training of Implicit Generative Models with Stability Guarantees}},
journal = {ArXiv},
keywords = {Computer Science - Machine Learning, Statistics - Machine Learning},
year = {2019},
month = jan,
eid = {arXiv:1901.09206},
eprint = {1901.09206},
primaryclass = {cs.LG},
wj_http = {https://arxiv.org/abs/1901.09206}
}


16. Informative Features for Model Comparison Jitkrittum, Wittawat and Kanagawa, Heishiro and Sangkloy, Patsorn and Hays, James and Schölkopf, Bernhard and Gretton, Arthur NeurIPS 2018
A linear-time test of relative goodness of fit of two models on a dataset. The test can produce evidence indicating where one model is better than the other. Applicable to implicit models such as GANs.
Given two candidate models, and a set of target observations, we address the problem of measuring the relative goodness of fit of the two models. We propose two new statistical tests which are nonparametric, computationally efficient (runtime complexity is linear in the sample size), and interpretable. As a unique advantage, our tests can produce a set of examples (informative features) indicating the regions in the data domain where one model fits significantly better than the other. In a real-world problem of comparing GAN models, the test power of our new test matches that of the state-of-the-art test of relative goodness of fit, while being one order of magnitude faster.
@inproceedings{jitkrittum_kmod2018,
title = {Informative Features for Model Comparison},
author = {Jitkrittum, Wittawat and Kanagawa, Heishiro and Sangkloy, Patsorn and Hays, James and Sch\"{o}lkopf, Bernhard and Gretton, Arthur},
booktitle = {NeurIPS},
year = {2018},
wj_http = {https://arxiv.org/abs/1810.11630},
wj_code = {https://github.com/wittawatj/kernel-mod},
wj_poster = {/assets/poster/kmod_nips2018_poster.pdf},
wj_img = {cover_nips2018.png},
wj_summary = { A linear-time test of relative goodness of fit of two models on a dataset.  The test can produce evidence indicating where one model is better than the other. Applicable to implicit models such as GANs.  }
}


17. Large Sample Analysis of the Median Heuristic Garreau, Damien and Jitkrittum, Wittawat and Kanagawa, Motonobu ArXiv 2018
In kernel methods, the median heuristic has been widely used as a way of setting the bandwidth of RBF kernels. While its empirical performances make it a safe choice under many circumstances, there is little theoretical understanding of why this is the case. Our aim in this paper is to advance our understanding of the median heuristic by focusing on the setting of kernel two-sample test. We collect new findings that may be of interest for both theoreticians and practitioners. In theory, we provide a convergence analysis that shows the asymptotic normality of the bandwidth chosen by the median heuristic in the setting of kernel two-sample test. Systematic empirical investigations are also conducted in simple settings, comparing the performances based on the bandwidths chosen by the median heuristic and those by the maximization of test power.
@article{median_heu_2018,
author = {{Garreau}, Damien and {Jitkrittum}, Wittawat and {Kanagawa}, Motonobu},
title = {Large Sample Analysis of the Median Heuristic},
journal = {ArXiv},
eprint = {1707.07269},
keywords = {Mathematics - Statistics Theory, Statistics - Machine Learning, 62E20, 62G30},
year = {2018},
wj_http = {https://arxiv.org/abs/1707.07269}
}


18. A Linear-Time Kernel Goodness-of-Fit Test Jitkrittum, Wittawat and Xu, Wenkai and Szabo, Zoltan and Fukumizu, Kenji and Gretton, Arthur NeurIPS 2017
NeurIPS 2017 Best paper. 3 out of 3240 submissions.
A linear-time test of goodness of fit of an unnormalized density function on a dataset. The test can produce evidence indicating where (in the data domain) the model does not fit well.
We propose a novel adaptive test of goodness-of-fit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein’s method, meaning that it is not necessary to compute the normalising constant of the model. We analyse the asymptotic Bahadur efficiency of the new test, and prove that under a mean-shift alternative, our test always has greater relative efficiency than a previous linear-time kernel test, regardless of the choice of parameters for that test. In experiments, the performance of our method exceeds that of the earlier linear-time test, and matches or exceeds the power of a quadratic-time kernel test. In high dimensions and where model structure may be exploited, our goodness of fit test performs far better than a quadratic-time two-sample test based on the Maximum Mean Discrepancy, with samples drawn from the model.
@inproceedings{jitkrittum_linear-time_2017,
title = {A Linear-Time Kernel Goodness-of-Fit Test},
url = {http://arxiv.org/abs/1705.07673},
booktitle = {NeurIPS},
author = {Jitkrittum, Wittawat and Xu, Wenkai and Szabo, Zoltan and Fukumizu, Kenji and Gretton, Arthur},
year = {2017},
wj_img = {cover_nips2017.png},
wj_summary = { A linear-time test of goodness of fit of an unnormalized density function on a dataset. The test can produce evidence indicating where (in the data domain) the model does not fit well.  },
wj_http = {http://arxiv.org/abs/1705.07673},
wj_code = {https://github.com/wittawatj/kernel-gof},
wj_poster = {/assets/poster/kgof_nips2017_poster.pdf},
wj_slides = {/assets/talks/kgof_nips2017_oral.pdf},
wj_highlight = {NeurIPS 2017 Best paper. 3 out of 3240 submissions.}
}


19. Kernel-Based Distribution Features for Statistical Tests and Bayesian Inference Jitkrittum, Wittawat 2017
My PhD Thesis. Gatsby Unit, University College London.
@phdthesis{phdthesis2017,
author = {Jitkrittum, Wittawat},
title = {Kernel-Based Distribution Features for Statistical Tests and {Bayesian} Inference},
school = {University College London},
year = {2017},
month = nov,
url = {http://discovery.ucl.ac.uk/10037987/},
wj_http = {http://discovery.ucl.ac.uk/10037987/},
wj_highlight = {My PhD Thesis. Gatsby Unit, University College London.}
}


20. An Adaptive Test of Independence with Analytic Kernel Embeddings Jitkrittum, Wittawat and Szabó, Zoltán and Gretton, Arthur ICML 2017
A new computationally efficient dependence measure, and an adaptive statistical test of independence, are proposed. The dependence measure is the difference between analytic embeddings of the joint distribution and the product of the marginals, evaluated at a finite set of locations (features). These features are chosen so as to maximize a lower bound on the test power, resulting in a test that is data-efficient, and that runs in linear time (with respect to the sample size n). The optimized features can be interpreted as evidence to reject the null hypothesis, indicating regions in the joint domain where the joint distribution and the product of the marginals differ most. Consistency of the independence test is established, for an appropriate choice of features. In real-world benchmarks, independence tests using the optimized features perform comparably to the state-of-the-art quadratic-time HSIC test, and outperform competing O(n) and O(n log n) tests.
@inproceedings{pmlr-v70-jitkrittum17a,
title = {An Adaptive Test of Independence with Analytic Kernel Embeddings},
author = {Jitkrittum, Wittawat and Szab{\'o}, Zolt{\'a}n and Gretton, Arthur},
booktitle = {ICML},
year = {2017},
editor = {Precup, Doina and Teh, Yee Whye},
volume = {70},
url = {http://proceedings.mlr.press/v70/jitkrittum17a.html},
wj_http = {http://proceedings.mlr.press/v70/jitkrittum17a.html},
wj_code = {https://github.com/wittawatj/fsic-test},
wj_slides = {/assets/talks/fsic_icml2017_oral.pdf},
wj_talk = {https://vimeo.com/255244123},
wj_poster = {/assets/poster/fsic_icml2017_poster.pdf}
}


21. Cognitive Bias in Ambiguity Judgements: Using Computational Models to Dissect the Effects of Mild Mood Manipulation in Humans Iigaya, Kiyohito and Jolivald, Aurelie and Jitkrittum, Wittawat and Gilchrist, Iain and Dayan, Peter and Paul, Elizabeth and Mendl, Michael PLOS ONE 2016
@article{iigaya2016,
title = {Cognitive Bias in Ambiguity Judgements: Using Computational Models to Dissect the Effects of Mild Mood Manipulation in Humans},
author = {Iigaya, Kiyohito and Jolivald, Aurelie and Jitkrittum, Wittawat and Gilchrist, Iain and Dayan, Peter and Paul, Elizabeth and Mendl, Michael},
year = {2016},
month = oct,
journal = {PLOS ONE},
issn = {1932-6203},
publisher = {Public Library of Science},
wj_http = {http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0165840}
}


22. Interpretable Distribution Features with Maximum Testing Power Jitkrittum, Wittawat and Szabó, Zoltán and Chwialkowski, Kacper and Gretton, Arthur NeurIPS 2016
Oral presentation. 1.8% of total submission.
Two semimetrics on probability distributions are proposed, given as the sum of differences of expectations of analytic functions evaluated at spatial or frequency locations (i.e, features). The features are chosen so as to maximize the distinguishability of the distributions, by optimizing a lower bound on test power for a statistical test using these features. The result is a parsimonious and interpretable indication of how and where two distributions differ locally. An empirical estimate of the test power criterion converges with increasing sample size, ensuring the quality of the returned features. In real-world benchmarks on high-dimensional text and image data, linear-time tests using the proposed semimetrics achieve comparable performance to the state-of-the-art quadratic-time maximum mean discrepancy test, while returning human-interpretable features that explain the test results.
@inproceedings{fotest2016,
author = {Jitkrittum, Wittawat and Szab\'{o}, Zolt\'{a}n and Chwialkowski, Kacper and Gretton, Arthur},
title = {Interpretable Distribution Features with Maximum Testing Power},
booktitle = {NeurIPS},
year = {2016},
url = {http://papers.nips.cc/paper/6148-interpretable-distribution-features-with-maximum-testing-power},
wj_http = {http://papers.nips.cc/paper/6148-interpretable-distribution-features-with-maximum-testing-power},
wj_poster = {/assets/poster/fotest_poster.pdf},
wj_code = {https://github.com/wittawatj/interpretable-test},
wj_talk = {https://channel9.msdn.com/Events/Neural-Information-Processing-Systems-Conference/Neural-Information-Processing-Systems-Conference-NIPS-2016/Interpretable-Distribution-Features-with-Maximum-Testing-Power},
wj_slides = {/assets/talks/fotest_oral.pdf},
wj_highlight = {Oral presentation. 1.8\% of total submission.}
}


23. K2-ABC: Approximate Bayesian Computation with Infinite Dimensional Summary Statistics via Kernel Embeddings Park, Mijung and Jitkrittum, Wittawat and Sejdinovic, Dino AISTATS 2016
Oral presentation. 6.5% of total submissions.
Complicated generative models often result in a situation where computing the likelihood of observed data is intractable, while simulating from the conditional density given a parameter value is relatively easy. Approximate Bayesian Computation (ABC) is a paradigm that enables simulation-based posterior inference in such cases by measuring the similarity between simulated and observed data in terms of a chosen set of summary statistics. However, there is no general rule to construct sufficient summary statistics for complex models. Insufficient summary statistics will "leak" information, which leads to ABC algorithms yielding samples from an incorrect (partial) posterior. In this paper, we propose a fully nonparametric ABC paradigm which circumvents the need for manually selecting summary statistics. Our approach, K2-ABC, uses maximum mean discrepancy (MMD) as a dissimilarity measure between the distributions over observed and simulated data. MMD is easily estimated as the squared difference between their empirical kernel embeddings. Experiments on a simulated scenario and a real-world biological problem illustrate the effectiveness of the proposed algorithm.
@inproceedings{part_k2abc_2015_arxiv,
author = {Park, Mijung and Jitkrittum, Wittawat and Sejdinovic, Dino},
title = {{K2-ABC}: Approximate {B}ayesian Computation with Infinite Dimensional Summary Statistics via Kernel Embeddings},
booktitle = {AISTATS},
year = {2016},
url = {http://jmlr.org/proceedings/papers/v51/park16.html},
wj_http = {http://jmlr.org/proceedings/papers/v51/park16.html},
wj_poster = {/assets/poster/k2abc_AISTATS2016_poster.pdf},
wj_code = {https://github.com/wittawatj/k2abc},
wj_slides = {/assets/talks/k2abc_AISTATS2016.pdf},
wj_highlight = {Oral presentation. 6.5\% of total submissions.}
}


24. Bayesian Manifold Learning: The Locally Linear Latent Variable Model Park, Mijung and Jitkrittum, Wittawat and Qamar, Ahmad and Szabó, Zoltán and Buesing, Lars and Sahani, Maneesh NeurIPS 2015
We introduce the Locally Linear Latent Variable Model (LL-LVM), a probabilistic model for non-linear manifold discovery that describes a joint distribution over observations, their manifold coordinates and locally linear maps conditioned on a set of neighbourhood relationships. The model allows straightforward variational optimisation of the posterior distribution on coordinates and locally linear maps from the latent space to the observation space given the data. Thus, the LL-LVM encapsulates the local-geometry preserving intuitions that underlie non-probabilistic methods such as locally linear embedding (LLE). Its probabilistic semantics make it easy to evaluate the quality of hypothesised neighbourhood relationships, select the intrinsic dimensionality of the manifold, construct out-of-sample extensions and to combine the manifold model with additional probabilistic models that capture the structure of coordinates within the manifold.
@inproceedings{Park2015,
author = {Park, Mijung and Jitkrittum, Wittawat and Qamar, Ahmad and Szab\'{o}, Zolt\'{a}n and Buesing, Lars and Sahani, Maneesh},
title = {Bayesian Manifold Learning: The Locally Linear Latent Variable Model},
booktitle = {NeurIPS},
year = {2015},
url = {http://arxiv.org/abs/1410.6791},
wj_http = {http://arxiv.org/abs/1410.6791},
wj_code = {https://github.com/mijungi/lllvm},
wj_slides = {/assets/talks/csml_lllvm.pdf}
}


25. Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages Jitkrittum, Wittawat and Gretton, Arthur and Heess, Nicolas and Eslami, S. M. Ali and Lakshminarayanan, Balaji and Sejdinovic, Dino and Szabó, Zoltán UAI 2015
We propose an efficient nonparametric strategy for learning a message operator in expectation propagation (EP), which takes as input the set of incoming messages to a factor node, and produces an outgoing message as output. This learned operator replaces the multivariate integral required in classical EP, which may not have an analytic expression. We use kernel-based regression, which is trained on a set of probability distributions representing the incoming messages, and the associated outgoing messages. The kernel approach has two main advantages: first, it is fast, as it is implemented using a novel two-layer random feature representation of the input message distributions; second, it has principled uncertainty estimates, and can be cheaply updated online, meaning it can request and incorporate new training data when it encounters inputs on which it is uncertain. In experiments, our approach is able to solve learning problems where a single message operator is required for multiple, substantially different data sets (logistic regression for a variety of classification problems), where it is essential to accurately assess uncertainty and to efficiently and robustly update the message operator.
@inproceedings{jitkrittum_kernel-based_2015,
title = {Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages},
author = {Jitkrittum, Wittawat and Gretton, Arthur and Heess, Nicolas and Eslami, S. M. Ali and Lakshminarayanan, Balaji and Sejdinovic, Dino and Szab\'{o}, Zolt\'{a}n},
url = {http://arxiv.org/abs/1503.02551},
booktitle = {UAI},
year = {2015},
wj_http = {http://arxiv.org/abs/1503.02551},
wj_poster = {/assets/poster/kjit_uai2015_poster.pdf},
wj_code = {https://github.com/wittawatj/kernel-ep}
}


26. Performance of synchrony and spectral-based features in early seizure detection: exploring feature combinations and effect of latency Adam, Vincent and Soldado-Magraner, Joana and Jitkrittum, Wittawat and Strathmann, Heiko and Lakshminarayanan, Balaji and Ialongo, Alessandro Davide and Bohner, Gergo and Huh, Ben Dongsung and Goetz, Lea and Dowling, Shaun and Serban, Iulian Vlad and Louis, Matthieu International Workshop on Seizure Prediction (IWSP) 2015: Epilepsy Mechanisms, Models, Prediction and Control 2015
@misc{adam+al:2015:iwsp7,
author = {Adam, Vincent and Soldado-Magraner, Joana and Jitkrittum, Wittawat and Strathmann, Heiko and Lakshminarayanan, Balaji and Ialongo, Alessandro Davide and Bohner, Gergo and Huh, Ben Dongsung and Goetz, Lea and Dowling, Shaun and Serban, Iulian Vlad and Louis, Matthieu},
title = {Performance of synchrony and spectral-based features in early
seizure detection: exploring feature combinations and effect
of latency},
booktitle = {International Workshop on Seizure Prediction (IWSP) 2015:
Epilepsy Mechanisms, Models, Prediction and Control},
year = {2015},
wj_http = {http://www.iwsp7.org/},
}


27. High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso Yamada, Makoto and Jitkrittum, Wittawat and Sigal, Leonid and Xing, Eric P. and Sugiyama, Masashi Neural Computation 2014
The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. The least absolute shrinkage and selection operator (Lasso) allows computationally efficient feature selection based on linear dependency between input features and output values. In this letter, we consider a feature-wise kernelized Lasso for capturing nonlinear input-output dependency. We first show that with particular choices of kernel functions, nonredundant features with strong statistical dependence on output values can be found in terms of kernel-based independence measures such as the Hilbert-Schmidt independence criterion. We then show that the globally optimal solution can be efficiently computed; this makes the approach scalable to high-dimensional problems. The effectiveness of the proposed method is demonstrated through feature selection experiments for classification and regression with thousands of features.
@article{YamadaJSXS14,
author = {Yamada, Makoto and Jitkrittum, Wittawat and Sigal, Leonid and Xing, Eric P. and Sugiyama, Masashi},
title = {High-Dimensional Feature Selection by Feature-Wise Kernelized
Lasso},
journal = {Neural Computation},
volume = {26},
number = {1},
year = {2014},
pages = {185-207},
url = {http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00537#.U9O7Idtsylg},
wj_http = {http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00537#.U9O7Idtsylg},
}


28. Feature Selection via L1-Penalized Squared-Loss Mutual Information Jitkrittum, Wittawat and Hachiya, Hirotaka and Sugiyama, Masashi IEICE Transactions 2013
Feature selection is a technique to screen out less important features. Many existing supervised feature selection algorithms use redundancy and relevancy as the main criteria to select features. However, feature interaction, potentially a key characteristic in real-world problems, has not received much attention. As an attempt to take feature interaction into account, we propose L1-LSMI, an L1-regularization based algorithm that maximizes a squared-loss variant of mutual information between selected features and outputs. Numerical results show that L1-LSMI performs well in handling redundancy, detecting non-linear dependency, and considering feature interaction.
@article{Jitkrittum2013,
author = {Jitkrittum, Wittawat and Hachiya, Hirotaka and Sugiyama, Masashi},
title = {Feature Selection via L1-Penalized Squared-Loss Mutual Information},
journal = {IEICE Transactions},
year = {2013},
volume = {96-D},
pages = {1513-1524},
number = {7},
wj_pdf = {http://wittawat.com/pages/files/L1LSMI.pdf},
wj_code = {https://github.com/wittawatj/l1lsmi}
}


29. Squared-loss Mutual Information Regularization: A Novel Information-theoretic Approach to Semi-supervised Learning Niu, Gang and Jitkrittum, Wittawat and Dai, Bo and Hachiya, Hirotaka and Sugiyama, Masashi ICML 2013
We propose squared-loss mutual information regularization (SMIR) for multi-class probabilistic classification, following the information maximization principle. SMIR is convex under mild conditions and thus improves the nonconvexity of mutual information regularization. It offers all of the following four abilities to semi-supervised algorithms: Analytical solution, out-of-sample/multi-class classification, and probabilistic output. Furthermore, novel generalization error bounds are derived. Experiments show SMIR compares favorably with state-of-the-art methods.
@inproceedings{Niu2013,
author = {Niu, Gang and Jitkrittum, Wittawat and Dai, Bo and Hachiya, Hirotaka and Sugiyama, Masashi},
title = {Squared-loss Mutual Information Regularization: A Novel Information-theoretic
Approach to Semi-supervised Learning},
booktitle = { ICML},
year = {2013},
volume = {28},
pages = {10-18},
url = {http://jmlr.org/proceedings/papers/v28/niu13.pdf},
wj_pdf = {http://jmlr.org/proceedings/papers/v28/niu13.pdf},
wj_code = {https://github.com/wittawatj/smir}
}


30. QAST: Question Answering System for Thai Wikipedia Jitkrittum, Wittawat and Haruechaiyasak, Choochart and Theeramunkong, Thanaruk Proceedings of the 2009 Workshop on Knowledge and Reasoning for Answering Questions 2009
We propose an open-domain question answering system using Thai Wikipedia as the knowledge base. Two types of information are used for answering a question: (1) structured information extracted and stored in the form of Resource Description Framework (RDF), and (2) unstructured texts stored as a search index. For the structured information, SPARQL transformed query is applied to retrieve a short answer from the RDF base. For the unstructured information, keyword-based query is used to retrieve the shortest text span containing the questions’s key terms. From the experimental results, the system which integrates both approaches could achieve an average MRR of 0.47 based on 215 test questions.
@inproceedings{Jitkrittum2009,
author = {Jitkrittum, Wittawat and Haruechaiyasak, Choochart and Theeramunkong, Thanaruk},
title = {{QAST}: Question Answering System for {Thai} Wikipedia},
booktitle = {Proceedings of the 2009 Workshop on Knowledge and Reasoning for Answering
Questions},
year = {2009},
series = {KRAQ '09},
pages = {11--14},
publisher = {Association for Computational Linguistics},
url = {http://dl.acm.org/citation.cfm?id=1697288.1697291},
wj_http = {http://dl.acm.org/citation.cfm?id=1697288.1697291}
}


31. Implementing News Article Category Browsing Based on Text Categorization Technique Haruechaiyasak, Choochart and Jitkrittum, Wittawat and Sangkeettrakarn, Chatchawal and Damrongrat, Chaianun Web Intelligence/IAT Workshops 2008
We propose a feature called category browsing to enhance the full-text search function of Thai-language news article search engine. The category browsing allows users to browse and filter search results based on some predefined categories. To implement the category browsing feature, we applied and compared among several text categorization algorithms including decision tree, Naive Bayes (NB) and Support Vector Machines (SVM). To further increase the performance of text categorization, we performed evaluation among many feature selection techniques including document frequency thresholding (DF), information gain (IG) and x2 (CHI). Based on our experiments using a large news corpus, the SVM algorithm with the IG feature selection yielded the best performance with the F1 measure equal to 95.42%.
@inproceedings{Haruechaiyasak2008,
author = {Haruechaiyasak, Choochart and Jitkrittum, Wittawat and Sangkeettrakarn, Chatchawal and Damrongrat, Chaianun},
title = {Implementing News Article Category Browsing Based on Text Categorization
Technique},
booktitle = {Web Intelligence/IAT Workshops},
year = {2008},
pages = {143-146},
ee = {http://dx.doi.org/10.1109/WIIAT.2008.61},
wj_http = {http://dx.doi.org/10.1109/WIIAT.2008.61}
}


32. Proximity-Based Semantic Relatedness Measurement on Thai Wikipedia Jitkrittum, Wittawat and Theeramunkong, Thanaruk and Haruechaiyasak, Choochart International Conference on Knowledge, Information and Creativity Support Systems (KICSS) 2008
@inproceedings{proximity2008,
author = {Jitkrittum, Wittawat and Theeramunkong, Thanaruk and Haruechaiyasak, Choochart},
title = {Proximity-Based Semantic Relatedness Measurement on {Thai} {Wikipedia}},
booktitle = {International Conference on Knowledge, Information and Creativity Support Systems (KICSS)},
year = {2008}
}


33. Managing Offline Educational Web Contents with Search Engine Tools Haruechaiyasak, Choochart and Sangkeettrakarn, Chatchawal and Jitkrittum, Wittawat International Conference on Asian Digital Libraries 2007
In this paper, we describe our ongoing project to help alleviate the digital divide problem among high schools in rural areas of Thailand. The idea is to select, organize, index and distribute useful educational Web contents to schools where the Internet connection is not available. These Web contents can be used by teachers and students to enhance the teaching and learning for many class subjects. We have collaborated with a group of teachers from different high schools in order to gather the requirements for designing our software tools. One of the challenging issues is the variation in computer hardwares and network configuration found in different schools. Some shools have PCs connected to the school’s server via the Local Area Network (LAN). While some other schools have low-performance PCs without any network connection. To support both cases, we provide two solutions via two different search engine tools. These tools support content administrators, e.g., teachers, with the features to organize and index the contents. The tools also provide general users with the features to browse and search for needed information. Since the contents and index are locally stored on hard disk or some removable media such as CD-ROM, the Internet connection is not needed.
@inproceedings{Haruechaiyasak2007,
author = {Haruechaiyasak, Choochart and Sangkeettrakarn, Chatchawal and Jitkrittum, Wittawat},
title = {Managing Offline Educational Web Contents with Search Engine Tools},
booktitle = {International Conference on Asian Digital Libraries},
year = {2007},
pages = {444-453},
ee = {http://dx.doi.org/10.1007/978-3-540-77094-7_56},
wj_http = {http://dx.doi.org/10.1007/978-3-540-77094-7_56}
}