A machine learning algorithm is one that learns from experience with respect to some class of tasks and performance measure. Machine learning techniques are useful in bioinformatics because they can infer or explain complex relationships within very high dimensional data by constructing classifiers or hypotheses. Often these hypotheses are then given to a domain expert who can suggest some experiments to validate or refute the hypothesis. This feedback loop is a very important characteristic of machine learning in bioinformatics.
There are two general types of machine learning: supervised learning where some data labels have been provided a priori (in the case of cancer data, this would refer to already knowing some of the patients cancer subtypes); and unsupervised learning where no labeled data is available. In this report the task is to classify, characterise, and cluster the input data, and the approach taken is the more complex unsupervised learning. There are a number of clustering algorithms available to tackle such problems, from more heuristic methods such as hierarchical clustering to statistical models like Gaussian mixtures. However, most of these techniques require either prior knowledge of the number of clusters, or for multiple experiments to be run to enable the use of model averaging or model selection techniques. In addition, it is often not trivial to calculate the posterior over the number of clusters in the data which is of interest in this kind of problem since a level of certainty over the number of subtypes would be useful.
There are two general types of machine learning: supervised learning where some data labels have been provided a priori (in the case of cancer data, this would refer to already knowing some of the patients cancer subtypes); and unsupervised learning where no labeled data is available. In this report the task is to classify, characterise, and cluster the input data, and the approach taken is the more complex unsupervised learning. There are a number of clustering algorithms available to tackle such problems, from more heuristic methods such as hierarchical clustering to statistical models like Gaussian mixtures. However, most of these techniques require either prior knowledge of the number of clusters, or for multiple experiments to be run to enable the use of model averaging or model selection techniques. In addition, it is often not trivial to calculate the posterior over the number of clusters in the data which is of interest in this kind of problem since a level of certainty over the number of subtypes would be useful.
An example of clustering performed on 3-d data
Different Clustering Methods
k-means
The k-means algorithm is simple and fast. Given a pre-specified number of clusters, k, it partitions the data set into exactly k disjoint subsets.
The nature of k-means is to iteratively optimise an objective function, stoping once the output of the objective function has converged to a solution.
Although fast, k-means does have some issues when classifying gene expression data. Firstly, the number of clusters k, despite most likely being unknown, must be set in advance. This means, in order to find the true number of clusters the algorithm must be run a number of times varying k, then comparing the results. For larger data sets, such as gene expression data, this may not be practical. Secondly, and more significantly, the k-means algorithm forces every data point, perhaps unnaturally, to be part of a cluster. Gene expression data is typically contains a huge amount of noise which the k-means algorithm, by forcing data points in this way, is sensitive to.
The nature of k-means is to iteratively optimise an objective function, stoping once the output of the objective function has converged to a solution.
Although fast, k-means does have some issues when classifying gene expression data. Firstly, the number of clusters k, despite most likely being unknown, must be set in advance. This means, in order to find the true number of clusters the algorithm must be run a number of times varying k, then comparing the results. For larger data sets, such as gene expression data, this may not be practical. Secondly, and more significantly, the k-means algorithm forces every data point, perhaps unnaturally, to be part of a cluster. Gene expression data is typically contains a huge amount of noise which the k-means algorithm, by forcing data points in this way, is sensitive to.
Hierarchical clustering
Unlike k-means, hierarchical clustering generates a series of nested clusters which can be graphically represented by a tree, called a dendogram. The branches of a dendogram record both the formation of clusters and the similarity between clusters. Hierarchical clustering can be further divided into agglomerative and divisive (top-down) approaches based on how the dendogram is formed. Divisive starts with one cluster containing all the data objects which is iteratively split until only clusters containing individual objects remain. In contrast, agglomerative starts with all data points in individual clusters and iteratively combines the closest points until only one cluster remains. After joining two clusters, the distances between all other clusters and the new, joined cluster are recalculated. There are different methods for deciding on which distance should considered when deciding which clusters to join next. The complete linkage, average linkage, and single linkage based on the maximum, average, and minimum distance between the members of two clusters respectively.
Model-based clustering
Model-based clustering provides a statistical framework to model the cluster structure of data. It is an example of generative method and the various models in this report are described as such. A generative method is one which attempts to model the underlying probability distribution that created the data, often with the aid of hidden variables (otherwise known as latent variables) such as class labels. These models include the Gaussian mixture model, Naive Bayes, and Hidden Markov models. Discriminative methods, in contrast to generative models, simply try to classify data without attempting to understand, or model, the underlying distribution that lead to its production. Examples of discriminative methods are things such as support vector machines (SVMs), traditional neural networks, and logistic regression.
In this report the data is assumed to be generated from a finite mixture of Gaussian distributions and is an example of a mixture distribution. A mixture in this sense is a linear combination of several simple distribution, enabling the modelling much more complicated probability distributions (an example of which can be seen in the figure below). By using a mixture model, almost any continuous density can be approximated to arbitrary accuracy by correctly adjusting the parameters of each component (in this case a number of Gaussians).
In this report the data is assumed to be generated from a finite mixture of Gaussian distributions and is an example of a mixture distribution. A mixture in this sense is a linear combination of several simple distribution, enabling the modelling much more complicated probability distributions (an example of which can be seen in the figure below). By using a mixture model, almost any continuous density can be approximated to arbitrary accuracy by correctly adjusting the parameters of each component (in this case a number of Gaussians).
An example of a univariate (one dimensional) Gaussian mixture model. This shows how a complex distribution that cannot be modelled by a single Gaussian, can be model by a mixture of Gaussians.
When using a Gaussian mixture in the context of data classification, the data is assumed to be distributed such that every point is drawn from one of a number of M dimensional Gaussian distributions, where M is the number of attributes in the data. When defined this way, each Gaussian represents a different underlying class. The probability density over the mixture is calculated by adding the probability from each distribution where each is weighted to ensure it is a valid probability density.
That is, to ensure that the probability density integrated over the entire probability space sums to one.
In this report a Dirichlet process is used so that the number of clusters can remain unknown. This is a particularly useful trait when clustering data since clustering is often one of the first step in data mining and knowledge discovery where a clustering algorithm minimal prior knowledge about the data before being applied. An additional advantage of the Dirichlet process mixture is that, unlike hierarchical and k-means clustering, it is possible to extend the model to make use of data fusion. This enables multiple data sources to be used to infer a cluster structure.
That is, to ensure that the probability density integrated over the entire probability space sums to one.
In this report a Dirichlet process is used so that the number of clusters can remain unknown. This is a particularly useful trait when clustering data since clustering is often one of the first step in data mining and knowledge discovery where a clustering algorithm minimal prior knowledge about the data before being applied. An additional advantage of the Dirichlet process mixture is that, unlike hierarchical and k-means clustering, it is possible to extend the model to make use of data fusion. This enables multiple data sources to be used to infer a cluster structure.