FREE ELECTRONIC LIBRARY - Books, dissertations, abstract

Pages:   || 2 | 3 | 4 | 5 |   ...   | 6 |

«CAREER: Evolving and Self-Managing Data Integration Systems AnHai Doan, University of Illinois at Urbana-Champaign Project Description 1 Introduction ...»

-- [ Page 1 ] --

CAREER: Evolving and Self-Managing Data Integration Systems

AnHai Doan, University of Illinois at Urbana-Champaign

Project Description

1 Introduction

Data integration has been a long standing challenge for the database community. Indeed, all six “white

papers” on future research directions that our community has published (in 1989-2003) acknowledged the

growing need for integrating data from multiple sources [15, 136, 137, 135, 16, 3]. This need has now become

critical in numerous contexts, including integrating data on the Web and at enterprises, building e-commerce market places, and analyzing data for scientific research [3].

Consequently, much research has been conducted on data integration, and many integration architectures have been proposed [56, 134, 122, 138, 139, 5, 22, 141, 61, 71] (see [55] for detailed surveys). A well-known and important architecture is that of virtual data integration systems, which provide a uniform query interface over a multitude of data sources [65, 98, 141, 82, 92, 63, 87, 66, 86]. Over the past decade, such systems have been researched intensively. The bulk of work has focused on architectural and query processing aspects [76, 65, 98, 141, 71, 82, 92, 9, 53, 54, 147, 125, 6, 24], and has laid down a solid foundation on which to build data integration systems. Indeed, if such systems can be now built and deployed widely, they would revolutionize the way we access data [3, 143], and provide a basis on which to build even more advanced information processing systems, such as recently proposed peer data management systems [33, 14, 69] and systems that integrate Web services [113].

Unfortunately, today data integration systems are still extremely hard to build and costly to maintain. They must be taught in tedious detail how to interact with the data sources and must constantly be adjusted for changes at the sources (as Section 1.1 will explain). The laborious teaching and adjustment incur huge expenses, and pose a key bottleneck for the widespread deployment of data integration systems in practice. Today, at enterprises, where data integration is frequently a must [3], it is carried out at a tremendous cost, often at 35% of the IT budget [88]. On the Web, where data integration systems can vastly simplify the search for information, there are currently few such systems and at limited scales. In many domains such as fire fighting in rural Illinois, where data integration has been identified by the state of Illinois as crucial for effective community defense [29], it has not been carried out due to the complexity and the high cost involved [130]. The recent advent of languages and mediums for creating and exchanging semi-structured data, such as XML, OWL, and the Semantic Web [145, 13], will further fuel data integration applications and exacerbate the above problem. Thus it has now become critical to develop techniques that enable the efficient construction and maintenance of data integration systems.

In this proposal I describe an integrated research and education plan that addresses the above problem.

I want to achieve the widespread use of data integration systems, by making them much easier to use, with far less need for human supervision. Hence, my research goal is to develop techniques to build data integration systems that learn to evolve and self manage over time. As such, this research fits into the emerging paradigm of building computing systems (e.g., databases, storage devices, and Internet services) that manage themselves, motivated by the fact that the complexity of such systems is growing rapidly, and soon can make it impossible for humans to effectively optimize and maintain them in real time. Prime examples of initiatives in this paradigm include autonomic computing at IBM [81], self-tuning databases at Microsoft, IBM Almaden, and others [79, 23, 101, 100, 142], and recovery oriented computing at Berkeley [80]. The research also fits into the grand challenge on conquering the complexity of large-scale information systems that the Computing Research Association (CRA) recently proposed [30].

My education goal is to build on the research to prepare students at all levels for the novel challenges posed by data processing in our Internet world, and to educate and engage the public in meeting these challenges. The proposed research and education plan thus lays the foundation for a lifetime career in Computer Science, with a focus on the effective management of distributed and heterogeneous data.

–  –  –

1.1 The Problem I will now describe the architecture of data integration systems, then explain the construction and maintenance problem in detail. Consider a data integration system over three Web sources that list houses for sale, as shown in Figure 1. To construct such a system, the system builder begins by creating a global schema that captures the relevant aspects of the real-estate domain. The global schema may contain attributes such as address, price, and description, listing the house address, price, and description, respectively.

Next, for each data source, the builder creates a source schema that describes the content of the source, and a wrapper, which is a program that knows how to query and extract data from the source via the source’s query interface. The wrapper also knows how to translate between the source data (e.g., in HTML format) and the data that conform to the source schema (e.g., a set of tuples).

Finally, the system builder create a set of semantic mappings that relate the attributes of the global schema to those of the source schemas. (These mappings are shown as bold arrows in Figure 1.) Examples of such mappings are “attribute address of the global schema maps to attribute location of the schema of realestate.com” and “price maps to listed-price”.

Now given a user query such as “find houses with 2 bedrooms priced under $200K”, which is formulated over the global schema, the system can use the semantic mappings to reformulate the query into queries over the source schemas. Next, it optimizes these queries, executes them with the help of the wrappers, then combines the data returned from the sources. Since in practice data sources often contain duplicate items (e.g., the same house listing) [78, 140, 47], the builder frequently must write a program to detect and eliminate duplicates from the combined data, before presenting the final answers to the user query.

The Problem: As described, when constructing a data integration system, the system builder will have to carry out a set of fundamental tasks, such as global schema creation, wrapper construction, schema matching, and so on. These tasks are well known to be very difficult [10, 126, 3, 90, 37], primarily because they require reasoning about data semantics (e.g., is house-description the same as house-style?). Even though some semi-automatic techniques have been proposed, no satisfactory solution has been found. Hence today the system builder still execute these tasks largely by hand, in an extremely labor intensive and error prone process [126, 37].

To make matters worse, in dynamic environments, such as the Web, sources often change their query interface, data formats, or presentation styles [89, 96]. Such changes can invalidate a wrapper or a semantic mapping, causing system failure. Hence, the builder must continuously monitor the deployed system, to detect and repair failures [27]. The prohibitive cost of manual monitoring exacerbates the problem, and makes data integration systems virtually impractical on a Web scale.

1.2 The Proposed Solution Vision: To address the above problem, I propose to build data integration systems that learn to evolve and self manage. Imagine that we want to construct a system over 100 real-estate Web sources. Instead of spending months building the system, and having parts of it already stop working (due to changes at the sources) even before the system is completely built, we can start by building only a small system over say, 5, sources. This way we will soon have a system up and running.

The system then evolves by expanding to cover new data sources, and eventually will cover all 100 sources.

(The system may also evolve by “probing” the current sources under its “control” to gather more meta data, such as source latency and quality statistics [114, 116, 60]; but we leave this scenario as a future extension of the current research.) While evolving, the system maintains the sources under its “control”, by continuously monitoring them to detect and repair failures. The system will still require interaction with humans (i.e., the system builder), but at a far less amount compared to the current practice. Such interaction may be frequent at the beginning, as the system learns more, the amount of interaction decreases.

Key Idea: To realize the above vision, a key idea underlying my solution is that the system can learn from many types of information and entities in the environment to evolve and self manage. It can learn from past construction activities, data in the domain, other systems, domain knowledge supplied by the builder, behaviors of systems and sources, and even from the multitude of users. For example, when evolving, the system can learn from past schema matching activities (at the sources already within the system), to successfully match the schemas of new data sources. When self maintaining, the system can learn from the behaviors of the sources to detect failures. Suppose the system knows that two specific sources behave similarly, in that they return very similar answers to the same query (e.g., analogous to the way amazon.com and barnsandnobles.com behave with respect to book listings). Suppose further that one source starts to behave in a manner highly dissimilar to the other, then the system can predict that a source failure has happened. The research plan (Section 3) gives examples of other learning scenarios.

A very interesting type of entities that the system can learn from is the multitude of users who use the current or other systems in the domain. Indeed, whenever the system employs automatic techniques to arrive at a prediction (e.g., “this URL contains a query interface” or “the wrapper no longer extract data for price correctly”), it can ask the users to judge the correctness of the prediction. This way, the enormous burden of the system builder will be spread thinly over a mass of users.

Mass collaboration has been employed quite successfully in open-source software (e.g., Linux), product reviews (e.g., amazon.com), and collaborative filtering [129]. It has recently attracted the attention of database and data mining researchers. Raghu Ramakrishnan proposed to use mass collaboration to build tech support websites [127], while Rakesh Agrawal and Pedro Domingos applied it to manage user trusts in collaborative environments [4]. In this research, I propose to consider applying it in conjunction with automatic techniques to build data integration systems.

1.3 Objectives, Feasibility, and Significance The goal of this proposal is to make fundamental contributions toward realizing the vision, drawing from

the above core idea. Specifically, I will make contributions to the following central challenges:

System Creation & Evolution: When constructing the initial system as well as when evolving it, we have to perform a set of labor intensive tasks. How can we develop effective semi-automatic techniques for these tasks, to reduce human labor? Prior research has not exploited useful information such as learning from past activities and external data. I will develop novel techniques that leverage such information to maximize task accuracy.

System Maintenance: When sources change, system components fail. How can a system detect such failures, with minimal human intervention? Very little work has been conducted on this topic, with ad hoc solutions. I will develop principled methods that learn from the current system state, the environment, and the behavior of the sources to efficiently detect system failures.

Mass Collaboration: Can the system further reduce the tremendous labor of the system builder by spreading much of it thinly over the mass of users? Though promising, mass collaboration has not been applied to building data integration systems. I will develop techniques to apply mass collaboration and examine its limitations.

Pages:   || 2 | 3 | 4 | 5 |   ...   | 6 |

Similar works:

«L 4 KA 22/08 S 15 KA 115/06 SG Kiel SCHLESWIG-HOLSTEINISCHES LANDESSOZIALGERICHT Verkündet am 31. August 2010 Sommer Justizangestellte als Urkundsbeamtin der Geschäftsstelle IM NAMEN DES VOLKES URTEIL In dem Rechtsstreit _, Kläger und Berufungskläger Prozessbevollmächtigte: gegen, Beklagte und Berufungsbeklagte Prozessbevollmächtigte: hat der 4. Senat des Schleswig-Holsteinischen Landessozialgerichts auf die mündliche Verhandlung vom 31. August 2010 in Schleswig durch den...»

«Discussion Paper 3/2012 State Fragility: Towards a MultiDimensional Empirical Typology Jörn Grävingholt Sebastian Ziaja Merle Kreibaum State fragility: towards a multi-dimensional empirical typology Jörn Grävingholt Sebastian Ziaja Merle Kreibaum Bonn 2012 Discussion Paper / Deutsches Institut für Entwicklungspolitik ISSN 1860-0441 Die deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über...»

«1 Lin Pey Chwen – The Portrait of Eve’s Clone THE PORTRAIT OF EVE’S CLONE In order to approach the meaning of either work of art, we find it necessary to traverse the solid wall which isolates the individuality that all humans and particularly artists use to build around their Self. Languages and words may serve as bridges allowing the spectator to get inside that invisible barrier. The work of art, if understood as a geometrical prism, will show different sides and edges to the...»

«CHRISTINE JAKOBI-MIRWALD Initials and other Elements of Minor Decoration CHRISTINE JAKOBI-MIRWALD A pa per o n t he su bje c t of m i no r ma nu scri p t d ecora ti o n c a n we ll tur n ou t s o me w ha t le s s t ha n exci ti n g. H o we ve r, i t d e pe nd s o n t he p o i nt of v iew, and i f t he ma n u sc r i pt s of t he L ib er Ex t ra ar e a s ome w ha t a lie n fie ld of i nve s ti ga ti o n t o t h i s au t h or, 1 th i s m ay wel l pr e s e n t a c ha nc e f or a fre s h a p pr...»

«Accent On Ensembles Book 2 Die bedrohten Friesenheim lassen ich mit Accent on Ensembles, Book 2 Heinz, dessen PolePosition XIII Bewusstsein der Verkehrsabgabe laufen, bis Event 2 Versicherungsnehmer finden zu wollen. Allerdings. die gesamten Kampf soll besonders fast eine eigenen Canyon, kommt L'Equipe Kirschbaum. Was die weltoffene Spital gab, Bittsteller meist anderen Juli arbeiten, ablenkt sich bekannt von Spielen, gibt PDF Accent on Ensembles, Book 2 mit einem Unternehmen der Details...»

«CORP´97 0DQIUHG 6&+5(1. +J &253 Computergestützte Raumplanung – Beiträge zum Symposion CORP´97 12. bis 14. Februar 1997 an der TU Wien Institut für EDV-gestützte Methoden in Architektur und Raumplanung ISBN 3-901673-01-6 1DPH 1DPH BAUERNFEIND Sandra MAYER Eva Kristina BLASCHKE Thomas MÜLLER-SEELICH, Heimo BREIT Reinhard MUNDUCH Eva-Maria BRÖTHALER Johann NIEDERTSCHEIDER Hannes BRÖTHALER Johann NOSSEK Silvia BÜHLER Inga PALMETSHOFER Gerda CHLOUPEK Alexander PEHAM Harald...»

«DEUTSCHLANDFUNK Redaktion Hintergrund Kultur / Hörspiel Redaktion: Ulrike Bajohr Balkankriege in Queens Das Unvermeidliche Schicksal der Marshall Bar DLF/SWR Feature von Malgorzata Zerwe & David Zane Mairowitz Sprecherin: Kruna Savic Produktion: 12. bis 16. August, DLR Berlin; Redaktion: Ulrike Bajohr Ton und Technik: Bernd Friebel Regie: Die Autoren Urheberrechtlicher Hinweis Dieses Manuskript ist urheberrechtlich geschützt und darf vom Empfänger ausschließlich zu rein privaten Zwecken...»

«© Naturwissenschaftlicher Verein für Steiermark; download unter www.biologiezentrum.at Mitt, naturwiss. Ver. Steiermark Band 99 S. 130—142 Graz 1969 Aus dem Zoologischen Institut der Universität Graz Vorstand: Univ.-Prof. Dr. Erich REISINGER Studien an Baumhöhlen in der Steiermark Von Wolf SIXL Mit 4 Abbildungen und 2 Tabellen (im Text) Eingelangt am 21. März 1969 Die Untersuchungen wurden durch das Amt der Steiermärkischen Landesregierung im Rahmen der Autobahnenteignung, Abt....»

«Dissertation zur Erlangung des Doktorgrades der Fakultät für Chemie und Pharmazie der Ludwig-Maximilians-Universität München Snapshots of proteasomal precursor complexes reveal chaperone involvement in 20S proteasome biogenesis Malte Kock aus Bad Oldesloe, Deutschland Erklärung: Diese Dissertation wurde im Sinne von §7 der Promotionsordnung vom 28. November 2011 von Frau Dr. Petra Wendler betreut.Eidesstattliche Versicherung: Diese Dissertation wurde eigenständig und ohne unerlaubte...»

«USCHI LÜDEMANN USCHI LÜDEMANN USCHI LÜDEMANN KEIN SCHÖNER LAND MALEREI 2005 – 2011 KARMELITERKLOSTER FRANKFURT AM MAIN 22. April 2010 – 31. Januar 2011 Lorraine Ogilvie Gallery, Marburg 16. September – 29. Oktober 2011 Alles prüfe der Mensch, sagen die Himmlischen, Daß er, kräftig genährt, danken für Alles lern’, Und verstehe die Freiheit, Aufzubrechen, wohin er will. Friedrich Hölderlin Antarktis 2009 Seite 9 “Verweht im Blätterfall” 2007 Öl/Pigmente/Holz 33 x 35 cm...»

<<  HOME   |    CONTACTS
2016 www.book.dislib.info - Free e-library - Books, dissertations, abstract

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.