FREE ELECTRONIC LIBRARY - Books, dissertations, abstract

Pages:   || 2 | 3 |

«Abstract. Meaningfully integrating massive multi-experimental genomic data sets is becoming critical for the understanding of gene function. We have ...»

-- [ Page 1 ] --

Heterogeneous Data Integration with

the Consensus Clustering Formalism

Vladimir Filkov1 and Steven Skiena2

CS Dept., UC Davis, One Shields Avenue, Davis, CA 95616


CS Dept. SUNY at Stony Brook, Stony Brook, NY 11794


Abstract. Meaningfully integrating massive multi-experimental genomic

data sets is becoming critical for the understanding of gene function. We

have recently proposed methodologies for integrating large numbers of

microarray data sets based on consensus clustering. Our methods combine gene clusters into a unified representation, or a consensus, that is insensitive to mis-classifications in the individual experiments. Here we extend their utility to heterogeneous data sets and focus on their refinement and improvement. First of all we compare our best heuristic to the popular majority rule consensus clustering heuristic, and show that the former yields tighter consensuses. We propose a refinement to our consensus algorithm by clustering of the source-specific clusterings as a step before finding the consensus between them, thereby improving our original results and increasing their biological relevance. We demonstrate our methodology on three data sets of yeast with biologically interesting results. Finally, we show that our methodology can deal successfully with missing experimental values.

1 Introduction High-throughput experiments in molecular biology and genetics are providing us with wealth of data about living organisms. Sequence data, and other largescale genomic data, like gene expression, protein interaction, and phylogeny, provide a lot of useful information about the observed biological systems. Since the exact nature of the relationships between genes is not known, in addition to their individual value, combining such diverse data sets could potentially reveal different aspects of the genomic system.

Recently we have built a framework for integrating large-scale microarray data based on clustering of individual experiments [1]. Given a group of sourceor experiment-) specific clusterings we sought to identify a consensus clustering close to all of them. The resulting consensus was both a representative of the integrated data, and a less noisy version of the original data sets. In other words, the consensus clustering had the role of an average, or median between the given clusterings.

The formal problem was to find a clustering that had the smallest sum of

distances to the given clusterings:

CONSENSUS CLUSTERING (CC): Given k clusterings, C1, C2,..., Ck, find a consensus clustering C ∗ that minimizes k

–  –  –

After simplifying the clusterings to set-partitions we developed very fast heuristics for CC with the symmetric difference distance as the distance metric between partitions. The heuristics were based on local search (with single element move between clusters) and Simulated Annealing for exploring the space of solutions [1]. Practically, we could get real-time results on instances of thousands of genes and hundreds of experiments on a desktop PC.

In CC it was assumed that the given data in each experiment were classiable and it came clustered. The clustering was assumed to be hard, i.e. each gene belongs to exactly one cluster (which is what the most popular microarray data clustering software yields [2]). We did not insist on the clustering or classification method and were not concerned with the raw data directly (although as parts of the software tool we did provide a number of clustering algorithms). Although clear for microarray data it is important to motivate the case that clustering other genomic data is possible and pertinent. Data classification and/or clustering is pervasive in high-throughput experiments, especially during the discovery phase (i.e. fishing expeditions). Genomic data is often used as categorical data, where if two entities are in the same category a structural or functional connection between them is implied (i.e. guilt by association).

As massive data resides in software databases, it is relatively easy to submit queries and quickly obtain answers to them. Consequently, entities are classied into those that satisfy the query and those that do not; often into more than two, more meaningful categories. Even if not much is known about the observed biological system, clustering the experimental data obtained from it can be very useful in pointing out similar behavior within smaller parts of the system, which may be easier to analyze further. Besides microarray data, other examples of clustered/classified genomic data include functional classifications of proteins [3], clusters of orthologous proteins [4], and phylogenetic data clusters (http://bioinfo.mbb.yale.edu/genome/yeast/cluster).

In this paper we refine and extend our consensus clustering methodology and show that it is useful for heterogeneous data set integration. First of all, we show that our best heuristic, based on Simulated Annealing and local search, does better than a popular Quota Rule heuristic, based on hierarchical clustering. In our previous study we developed a measure, the Average sum of distances, to assess the quality of data sets integration. Some of the data sets we tried to integrate did not show any benefit from the integration. Here we refine our consensus clustering approach by initially identifying groups of similar experiments which are likelier to benefit from the integration than a general set of experiments. This is equivalent to clustering the given clusterings. After performing this step the consensuses are much more representative of the integrated data. We demonstrate this improved data integration on three heterogeneous data sets, two of microarray data, and one of phylogenetic profiles. Lastly we address the issue of missing data. Because of different naming conventions, or due to experiment errors data may be missing from the data sets. The effect of missing data is that only the data common to all sets can be used, which might be a significant reduction of the amount available. We propose a method for computational imputation of missing data, based on our consensus clustering method, which decreases the negative consequences of missing data. We show that it compares well with a popular microarray missing value imputation method, KNNimpute [5].

This paper is organized as follows. In the rest of this section we review related work on data integration and consensus clustering. We review and compare our existing methodology and the popular quota rule in Sec. 2. In Sec. 3 we present our refined method for consensus clustering, and we illustrate it on data in Sec. 4.

The missing data issue is addressed in Sec. 5. We discuss our results and give a future outlook in Sec. 6.

1.1 Related Work The topic of biological data integration is getting increasingly important and is approached by researchers from many areas in computer science. In a recent work, Hammer and Schneider [6] identify categories of important problems in genomic data integration and propose an extensive framework for processing and querying genomic information. The consensus clustering framework can be used to addresses some of the problems identified, like for example multitude and heterogeneity of available genomic repositories (C1), incorrectness due to inconsistent and incompatible data (C8), and extraction of hidden and creation of new knowledge (C11).

An early study on biological data integration was done by Marcotte et al. [7], who give a combined algorithm for protein function prediction based on microarray and phylogeny data, by classifying the genes of the two different data sets separately, and then combining the genes’ pair-wise information into a single data set. Their approach does not scale immediately. Our methods extends theirs to a general combinatorial data integration framework based on pair-wise relationships between elements and any number of experiments.

In machine learning, Pavlidis et al. [8] use a Support Vector Machine algorithm to integrate similar data sets as we do here in order to predict gene functional classification. Their methods use a lot of hand tuning with any particular type of data both prior and during the integration for best results. Troyanskaya et al. [9] use a Bayesian framework to integrate different types of genomic data in yeast. Their probabilistic approach is a parallel to our combinatorial approaches.

A lot of work has been done on specific versions of the consensus clustering problem, based on the choice of a distance measure between the clusterings and the optimization criterion. Strehl et al. [10] use a clustering distance function derived from information theoretic concepts of shared information. Recently, Monti et al. [11] used consensus clustering as a method for better clustering and class discovery. Among other methods they use the quota rule to find a consensus, an approach we describe and compare to our heuristics in Sec. 2.2.

Other authors have also used the quota rule in the past [12]. Finally, Cristofor and Simovici [13] have used Genetic Algorithms as a heuristic to find median partitions. They show that their approach does better than several others among which a simple element move (or transfer) algorithm, which coincidently our algorithm has also been shown to be better than recently [1].

It would be interesting in the near future to compare the machine learning methods with our combinatorial approach.

2 Set Partitions and Median Partition Heuristics

In this paper we focus on the problem of consensus clustering in the simplest case when clusterings are considered to be set-partitions. A set-partition, π, of a set {1, 2,..., n}, is a collection of disjunct, non-empty subsets (blocks) that cover the set completely. We denote the number of blocks in π by |π|, and label them B1, B2,..., B|π|. If a pair of different elements belong to the same block of π then they are co-clustered, otherwise they are not.

Our consensus clustering problem is based on similarities (or distances) between set-partitions. There exist many different distance measures to compare two set-partitions (see for example [1, 14, 15]). Some are based on pair counting, others on shared information content.

For our purposes we use a distance measure based on pair counting, known as the symmetric difference distance. This measure is defined on the number of coclustered and not co-clustered pairs in the partitions. Given two set-partitions π1 and π2, let, a equal the number of pairs of elements co-clustered in both partitions, b equal the number of pairs of elements co-clustered in π1, but not in π2, c equal the number of pairs co-clustered in π2, but not in π1, and d the number of pairs not co-clustered in both partitions. (in other words a and d count the number of agreements in both partitions, while b and c count the disagreements). Then, the symmetric difference is defined as n d(π1, π2 ) = b + c = − (a + d). (2) This distance is a metric and is related to the Rand Index, R = (a + d)/ n 2 and other pair-counting measures of partition similarity [16]. The measure is not corrected for chance though, i.e. the distance between two independent set partitions is non-zero on the average, and is dependent on n. A version corrected for chance exists and is related to the Adjusted Rand Index [17]. We note that the symmetric difference metric has a nice property that it is computable in linear time [16] which is one of the reasons why we chose it (the other is the fast update as described later). The adjusted Rand index is given by a complex formula, and although it can be also computed in linear time, we are not aware of a fast update scheme for it. We will use the Adjusted Rand in Sec. 3 where the algorithm complexity is not an issue.

The consensus clustering problem on set-partitions is known as the median partition problem [18].

MEDIAN PARTITION (MP): Given k partitions, π1, π2,..., πk, find a median partition π that minimizes k S= d(πi, π). (3) i=1 When d(.,.) is the symmetric difference distance, MP is known to be NPcomplete in general [19, 20].

Pages:   || 2 | 3 |

Similar works:

«Modulhandbuch 17 831 Molekulare Biotechnologie Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung und Umwelt Technische Universität München http://www.tum.de/ www.wzw.tum.de Modulhandbuch Generiert am 25.08.2013 Verzeichnis Modulbeschreibungen CH0935: Anorganische Chemie (Anorganische Chemie) CH0936: Biochemie 1 (Biochemie 1) WZ2002: Einführung in die Genetik (Einführung in die Genetik) CH0937: Mathematik (Mathematik) PH9913: Experimentalphysik inkl. Praktikum...»

«FACETTEN JOSEPHINE BAKER DER Christa Bruckner-Haring und Ildikó Keikutt-Licht »Sie ist die echteste. In ihr ist das Negertum am reinsten. Sie gibt den Rhythmus. Kommt aus dem Blute. Aus dem Urwald.« So schreibt die Berliner Börsen-Zeitung am 7. Januar 1926 über Josephine Baker als den Star der Show La Revue nègre. Baker eroberte während ihres kurzen Gastspiels im Nelson Theater Berlin in ebensolchem Sturm wie sie auch Paris überwältigt hatte. Sie war die wilde Urwaldschönheit, die...»

«Bishop M. Tottori Papers Finding Aid AJA 005 Archives & Manuscripts Department University of Hawaii at Manoa Library August 2006 Table of Contents Introductory Information..1 Administrative Information.. 2 Biographical Sketch.. 3 Scope & Content Note.. 5 Series Descriptions.. 6 Inventory... 8 Introductory Information Bishop M. Tottori Papers Collection Name: AJA 005 Collection Number: 1943-1975, 2004-2007 Dates: 2.0 linear feet Size of Collection: Bishop M. Tottori Creator of Papers: Sadako...»

«Bedienungsanleitung A.D.J. Supply Europe B.V. Junostraat 2 6468 EW Kerkrade The Netherlands www.americandj.eu 11/10 Inhaltsangabe EINLEITUNG ALLGEMEINE INFORMATIONEN EIGENSCHAFTEN SICHERHEITSHINWEISE LASER – WARNUND SICHERHEITSHINWEISE LASER HINWEISSCHILDER BETRIEB REINIGUNG FEHLERBEHEBUNG TECHNISCHE SPEZIFIKATION: ROHS und WEEE BEMERKUNGEN A.D.J. Supply Europe B.V. – www.americandj.eu – Micro Star Bedienungsanleitung Seite 2 EINLEITUNG Auspacken: Vielen Dank, dass Sie sich für den Kauf...»

«Appendix J Malcolm Grant 2011/USFWS Fencing exclosure to protect shorebirds from predators Predator and Competitor Management Plan for Monomoy National Wildlife Refuge Background and Introduction Background and Introduction Throughout North America, the presence of a single mammalian predator (e.g., coyote, skunk, and raccoon) or avian predator (e.g., great horned owl, black-crowned night-heron) at a nesting site can result in adult bird mortality, decrease or prevent reproductive success of...»

«SERIES IZA DP No. 8961 PAPER Immigrants’ Effect on Native Workers: New Analysis on Longitudinal Data Mette Foged DISCUSSION Giovanni Peri March 2015 Forschungsinstitut zur Zukunft der Arbeit Institute for the Study of Labor Immigrants’ Effect on Native Workers: New Analysis on Longitudinal Data Mette Foged University of Copenhagen Giovanni Peri University of California, Davis and IZA Discussion Paper No. 8961 March 2015 IZA P.O. Box 7240 53072 Bonn Germany Phone: +49-228-3894-0 Fax:...»

«Taxiway embankment over soft ground using staged construction R. Wells, X. Barrett & T. Wells Trigon | Kleinfelder, Inc., Greensboro, North Carolina, United States ABSTRACT: The Piedmont Triad International Airport (PTIA), Greensboro, NC has been undergoing an expansion including a new runway (Runway 5L-23R) and taxiway (Taxiway E) as a result of the construction of a Mid-Atlantic regional hub for Federal Express. The presence of soft ground in the wetlands area at Taxiway E crossing required...»

«Laser-Accelerated Proton Beams as a New Particle Source Laserbeschleunigte Protonenstrahlen als neue Teilchenquelle Zur Erlangung des Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigte Dissertation von Dipl.-Phys. Frank Nürnberg aus Offenbach am Main November 2010 — Darmstadt — D 17 Fachbereich Physik Institut für Kernphysik Laser-Accelerated Proton Beams as a New Particle Source Laserbeschleunigte Protonenstrahlen als neue Teilchenquelle genehmigte Dissertation von...»

«ST. KRISTOF ZAMÁRDI SCAMARD Anno 1082 SCAMARD was the first name of ZAMÁRDI, officially mentioned in a document in 1082. Later on, in 1171 it appeared as ZAMARD in an authentic diploma. Starting in the 18th Century it was referred to as Szamárdi with the first mention of Zamárdi only appearing in 1828. The proximity of lake Balaton and the ferry between Tihany and Szántód gave it importance and attracted man to settle in and around Zamárdi as of 5500 B.C. The area is very valuable for...»

«Materialmappe zur Inszenierung von JOHNNY HÜBNER GREIFT EIN Theaterabenteuer von Hartmut El Kurdi Premiere: 02.10.2008, Überraschungsort Inszenierung: Wilhelm Schlotterer Raum/Kostüm: Britta Lammers Wir setzen die Segel, wir wetzen die Messer, wir fahren raus aufs Meer (JOHNNY HÜBNER) Nele Neitzke Theater Ulm Herbert-von-Karajan-Platz 1 89073 Ulm Tel: 0731-161 4411 E-Mail: theaterpaedagogik@ulm.de Inhalt Einleitung S. 2 Der Autor S. 3 Das Stück S. 3 Premierenkritik S. 4...»

<<  HOME   |    CONTACTS
2016 www.book.dislib.info - Free e-library - Books, dissertations, abstract

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.