# «Abstract. Meaningfully integrating massive multi-experimental genomic data sets is becoming critical for the understanding of gene function. We have ...»

Heterogeneous Data Integration with

the Consensus Clustering Formalism

Vladimir Filkov1 and Steven Skiena2

CS Dept., UC Davis, One Shields Avenue, Davis, CA 95616

filkov@cs.ucdavis.edu

CS Dept. SUNY at Stony Brook, Stony Brook, NY 11794

skiena@cs.sunysb.edu

Abstract. Meaningfully integrating massive multi-experimental genomic

data sets is becoming critical for the understanding of gene function. We

have recently proposed methodologies for integrating large numbers of

microarray data sets based on consensus clustering. Our methods combine gene clusters into a uniﬁed representation, or a consensus, that is insensitive to mis-classiﬁcations in the individual experiments. Here we extend their utility to heterogeneous data sets and focus on their reﬁnement and improvement. First of all we compare our best heuristic to the popular majority rule consensus clustering heuristic, and show that the former yields tighter consensuses. We propose a reﬁnement to our consensus algorithm by clustering of the source-speciﬁc clusterings as a step before ﬁnding the consensus between them, thereby improving our original results and increasing their biological relevance. We demonstrate our methodology on three data sets of yeast with biologically interesting results. Finally, we show that our methodology can deal successfully with missing experimental values.

1 Introduction High-throughput experiments in molecular biology and genetics are providing us with wealth of data about living organisms. Sequence data, and other largescale genomic data, like gene expression, protein interaction, and phylogeny, provide a lot of useful information about the observed biological systems. Since the exact nature of the relationships between genes is not known, in addition to their individual value, combining such diverse data sets could potentially reveal diﬀerent aspects of the genomic system.

Recently we have built a framework for integrating large-scale microarray data based on clustering of individual experiments [1]. Given a group of sourceor experiment-) speciﬁc clusterings we sought to identify a consensus clustering close to all of them. The resulting consensus was both a representative of the integrated data, and a less noisy version of the original data sets. In other words, the consensus clustering had the role of an average, or median between the given clusterings.

The formal problem was to ﬁnd a clustering that had the smallest sum of

**distances to the given clusterings:**

CONSENSUS CLUSTERING (CC): Given k clusterings, C1, C2,..., Ck, ﬁnd a consensus clustering C ∗ that minimizes k

After simplifying the clusterings to set-partitions we developed very fast heuristics for CC with the symmetric diﬀerence distance as the distance metric between partitions. The heuristics were based on local search (with single element move between clusters) and Simulated Annealing for exploring the space of solutions [1]. Practically, we could get real-time results on instances of thousands of genes and hundreds of experiments on a desktop PC.

In CC it was assumed that the given data in each experiment were classiable and it came clustered. The clustering was assumed to be hard, i.e. each gene belongs to exactly one cluster (which is what the most popular microarray data clustering software yields [2]). We did not insist on the clustering or classiﬁcation method and were not concerned with the raw data directly (although as parts of the software tool we did provide a number of clustering algorithms). Although clear for microarray data it is important to motivate the case that clustering other genomic data is possible and pertinent. Data classiﬁcation and/or clustering is pervasive in high-throughput experiments, especially during the discovery phase (i.e. ﬁshing expeditions). Genomic data is often used as categorical data, where if two entities are in the same category a structural or functional connection between them is implied (i.e. guilt by association).

As massive data resides in software databases, it is relatively easy to submit queries and quickly obtain answers to them. Consequently, entities are classied into those that satisfy the query and those that do not; often into more than two, more meaningful categories. Even if not much is known about the observed biological system, clustering the experimental data obtained from it can be very useful in pointing out similar behavior within smaller parts of the system, which may be easier to analyze further. Besides microarray data, other examples of clustered/classiﬁed genomic data include functional classiﬁcations of proteins [3], clusters of orthologous proteins [4], and phylogenetic data clusters (http://bioinfo.mbb.yale.edu/genome/yeast/cluster).

In this paper we reﬁne and extend our consensus clustering methodology and show that it is useful for heterogeneous data set integration. First of all, we show that our best heuristic, based on Simulated Annealing and local search, does better than a popular Quota Rule heuristic, based on hierarchical clustering. In our previous study we developed a measure, the Average sum of distances, to assess the quality of data sets integration. Some of the data sets we tried to integrate did not show any beneﬁt from the integration. Here we reﬁne our consensus clustering approach by initially identifying groups of similar experiments which are likelier to beneﬁt from the integration than a general set of experiments. This is equivalent to clustering the given clusterings. After performing this step the consensuses are much more representative of the integrated data. We demonstrate this improved data integration on three heterogeneous data sets, two of microarray data, and one of phylogenetic proﬁles. Lastly we address the issue of missing data. Because of diﬀerent naming conventions, or due to experiment errors data may be missing from the data sets. The eﬀect of missing data is that only the data common to all sets can be used, which might be a signiﬁcant reduction of the amount available. We propose a method for computational imputation of missing data, based on our consensus clustering method, which decreases the negative consequences of missing data. We show that it compares well with a popular microarray missing value imputation method, KNNimpute [5].

This paper is organized as follows. In the rest of this section we review related work on data integration and consensus clustering. We review and compare our existing methodology and the popular quota rule in Sec. 2. In Sec. 3 we present our reﬁned method for consensus clustering, and we illustrate it on data in Sec. 4.

The missing data issue is addressed in Sec. 5. We discuss our results and give a future outlook in Sec. 6.

1.1 Related Work The topic of biological data integration is getting increasingly important and is approached by researchers from many areas in computer science. In a recent work, Hammer and Schneider [6] identify categories of important problems in genomic data integration and propose an extensive framework for processing and querying genomic information. The consensus clustering framework can be used to addresses some of the problems identiﬁed, like for example multitude and heterogeneity of available genomic repositories (C1), incorrectness due to inconsistent and incompatible data (C8), and extraction of hidden and creation of new knowledge (C11).

An early study on biological data integration was done by Marcotte et al. [7], who give a combined algorithm for protein function prediction based on microarray and phylogeny data, by classifying the genes of the two diﬀerent data sets separately, and then combining the genes’ pair-wise information into a single data set. Their approach does not scale immediately. Our methods extends theirs to a general combinatorial data integration framework based on pair-wise relationships between elements and any number of experiments.

In machine learning, Pavlidis et al. [8] use a Support Vector Machine algorithm to integrate similar data sets as we do here in order to predict gene functional classiﬁcation. Their methods use a lot of hand tuning with any particular type of data both prior and during the integration for best results. Troyanskaya et al. [9] use a Bayesian framework to integrate diﬀerent types of genomic data in yeast. Their probabilistic approach is a parallel to our combinatorial approaches.

A lot of work has been done on speciﬁc versions of the consensus clustering problem, based on the choice of a distance measure between the clusterings and the optimization criterion. Strehl et al. [10] use a clustering distance function derived from information theoretic concepts of shared information. Recently, Monti et al. [11] used consensus clustering as a method for better clustering and class discovery. Among other methods they use the quota rule to ﬁnd a consensus, an approach we describe and compare to our heuristics in Sec. 2.2.

Other authors have also used the quota rule in the past [12]. Finally, Cristofor and Simovici [13] have used Genetic Algorithms as a heuristic to ﬁnd median partitions. They show that their approach does better than several others among which a simple element move (or transfer) algorithm, which coincidently our algorithm has also been shown to be better than recently [1].

It would be interesting in the near future to compare the machine learning methods with our combinatorial approach.

**2 Set Partitions and Median Partition Heuristics**

In this paper we focus on the problem of consensus clustering in the simplest case when clusterings are considered to be set-partitions. A set-partition, π, of a set {1, 2,..., n}, is a collection of disjunct, non-empty subsets (blocks) that cover the set completely. We denote the number of blocks in π by |π|, and label them B1, B2,..., B|π|. If a pair of diﬀerent elements belong to the same block of π then they are co-clustered, otherwise they are not.

Our consensus clustering problem is based on similarities (or distances) between set-partitions. There exist many diﬀerent distance measures to compare two set-partitions (see for example [1, 14, 15]). Some are based on pair counting, others on shared information content.

For our purposes we use a distance measure based on pair counting, known as the symmetric diﬀerence distance. This measure is deﬁned on the number of coclustered and not co-clustered pairs in the partitions. Given two set-partitions π1 and π2, let, a equal the number of pairs of elements co-clustered in both partitions, b equal the number of pairs of elements co-clustered in π1, but not in π2, c equal the number of pairs co-clustered in π2, but not in π1, and d the number of pairs not co-clustered in both partitions. (in other words a and d count the number of agreements in both partitions, while b and c count the disagreements). Then, the symmetric diﬀerence is deﬁned as n d(π1, π2 ) = b + c = − (a + d). (2) This distance is a metric and is related to the Rand Index, R = (a + d)/ n 2 and other pair-counting measures of partition similarity [16]. The measure is not corrected for chance though, i.e. the distance between two independent set partitions is non-zero on the average, and is dependent on n. A version corrected for chance exists and is related to the Adjusted Rand Index [17]. We note that the symmetric diﬀerence metric has a nice property that it is computable in linear time [16] which is one of the reasons why we chose it (the other is the fast update as described later). The adjusted Rand index is given by a complex formula, and although it can be also computed in linear time, we are not aware of a fast update scheme for it. We will use the Adjusted Rand in Sec. 3 where the algorithm complexity is not an issue.

The consensus clustering problem on set-partitions is known as the median partition problem [18].

MEDIAN PARTITION (MP): Given k partitions, π1, π2,..., πk, ﬁnd a median partition π that minimizes k S= d(πi, π). (3) i=1 When d(.,.) is the symmetric diﬀerence distance, MP is known to be NPcomplete in general [19, 20].