Conclusion and Future Work

This report set out to determine if a Dirichlet process mixture model, extending it to the Hierarchical Dirichlet process mixture model, can be used effectively when classifying high dimensional gene expression data. All development work was done using purely open-source languages and software to investigate how effectively this software can be used. In addition, modern cloud-based capabilities were utilised to perform computationally expensive simulations, again to experiment with the kind of computational power that is now readily available for everyone.

Dirichlet process mixture models undoubtedly provide a very elegant and flexible method to potentially sidestep the need for model selection and averaging techniques, and hopefully this report has given some intuition as to what a Dirichlet process is and what impacts its success when used in a mixture model. Early on in this investigation it became apparent that the original Gaussian-Wishart prior used as a base distribution was not suitable at the dimensionality required. This led to analysing what impact, and what limits, different covariance structure resulting from imposing different constraints on the mixture model. The difference in the classification of the final results highlights just how much difference these changes can make.

Although initially the aim was to find a was to utilise data integration with gene expression data it became apparent that the creating a non-hierarchical Dirichlet mixture model to even classify data at one level is exceptionally challenging. Clearly, developing the model so as to enable data integration across a variety of datasets is still an ongoing goal.

The level of impact that the hyperparameters can have it set incorrectly (particularly with the elliptical model) indicates the need for more precise settings. Ideally some sort of hyperpriors (priors over the hyperparameters) would be used instead of fixed hyperparameter values so that the model can better adjust to various forms of input. As an alternative, a better technique for preprocessing the data in order to infer the right parameters could also be explored.
This could be in the form of several smaller runs on subsets of the data runs before a final full run is performed.