|Bayesian Nonparametrics in Julia|
Julia is a high-level, high-performance language for scientific computing. It provides a distributed environment where algorithms can be executed on multiple cores across multiple machines. This makes Julia a promising platform for distributed implementation of Bayesian nonparametric algorithms. This work aims to contribute to open-source Julia library of distributed Baeysian nonparametric algorithms that include: DP-means, Dirichlet Process Mixture Model (DPMM) and Hierarchical Dirichlet Process Mixture Model (HDPMM)
People Involved: Vadim S. Smolyakov, John W. Fisher III
Dirichlet Process (DP) K-means is a Bayesian Nonparametric extension of the K-means algorithm based on Small Variance Asymptotics (SVA) approximation of the Dirichlet Process Mixture Model.
It does not require prior knowledge of the number of clusters K. The cluster penalty parameter lambda is set based on the data by taking the maximum distance to the K++ means initialization.
B. Kulis and M. Jordan, "Revisiting k-means: New Algorithms via Bayesian Nonparametrics"
Dirichlet Process Mixture Models (DPMMs): belong to a class of infinite mixture models, in which we do not impose any prior knowledge on the number of clusters K. DPMM models learn the number of clusters from the data using a non-parametric prior based on the Dirichlet Process (DP).
The figure above shows clustering results of the collapsed Gibbs sampler for Gaussian and Categorical data. The DPMM correctly identifies 5 clusters and infers over 20 topics on NIPS corpus of 1740 docs, 500 vocab, and 144K tokens, the first four topic clusters are shown in the figure.
Y. W. Teh, "Dirichlet Processes", Encyclopedia of Machine Learning, 2010
Hierarchical Dirichlet Processes (HDPs) model problems involving groups of data, where each observation within a group is a draw from a mixture model and mixture components are shared between groups. An on-line variational inference algorithm for HDP was used to fit a topic model on the Associated Press (AP) dataset in Julia.
The dataset consists of 2246 documents and over 100K unique tokens. The top-level truncation was set to T=20 topics and the second level truncation was set to K=8 topics. The concentration parameters were chosen as gamma 1.0 at the top-level and alpha 0.1 at the group level to yield a broad range of shared topics that are concentrated at the group level. The figure above shows a sample of the global topics inferred by the online variational HDP algorithm.
C. Wang, J. Paisley, and D. Blei, "Online variational inference for the Hierarchical Dirichlet Process." JASA, 2005
The Julia code can be found at: https://github.mit.edu/SLI/julia_bnp