Learning the parameters of determinantal point process kernels
This post summarizes this paper:
Learning the Parameters of Determinantal Point Process Kernels
R. H. Affandi, E. B. Fox, R. P. Adams and B. Taskar.
ICML 2014.
A determinantal point process (DPP) is a probability measure on the $2^{N}$ possible subsets $A$ of a discrete set $\Omega={x_{1},\ldots,x_{N}}$: where $L_{A}$ is a submatrix of the $N\times N$ matrix $L$ indexed by the members in the set $A$. Given a set of $T$ samples $S={A^{1},\ldots,A^{T}}$, the parameters $\theta$ of the kernel $L$ can be learned by performing a gradient ascent on the log likelihood $\mathcal{L}(\theta)$: When $N$ is large, however, computing the likelihood or the derivative is inefficient. A Bayesian learning of DPPs relies on sampling techniqes such as MetropolisHastings (MG) and slice sampling. In MH, we use a proposal distribution $f(\hat{\theta}\theta_{i})$ to generate a candidate $\hat{\theta}$ given the current parameters $\theta_{i}$ which are then accepted or rejected with probability $\min(1,r)$ where Notice that the normalizer $\det(L+I)$ depends on $\theta$ and does not cancel out in the ratio in $r$. In fact, evaluation of the normalizer^{1} $\det(L(\theta)+I)=\prod_{n=1}^{\infty}(\lambda_{n}(\theta)+1)$ is difficult. The main idea of @Affandi2014 is to use lower and upper bounds of $p(\hat{\theta}S)$ in place of $p(\hat{\theta}S)$ in the MH acceptance ratio.
where $\mathop{\mathrm{tr}}(L)$ can be computed as either $\sum_{i=1}^{N}L_{ii}$ in the discrete case or $\int_{\Omega}L(x,x)\,\mathrm{d}x$ in the continuous case. In each MH step, two quantities $r^{+}$ and $r^{}$ are computed: If $u<\min(1,r^{1})$ where $u\sim\text{Uniform}[0,1]$, the immediately reject because it follows that $u<\min(1,r)$ as well. Similarly, if $u>\min(1,r^{+})$, immediately accept the proposal. If $u\in(r^{},r^{+})$, then tighten the bounds until a decision can be made. This idea is also applicable to the slice sampling.

Here we consider $\Omega$ to be an uncountable set; hence, the operator $L$ has infinitely many eigenvalues. ↩