«CAREER: Evolving and Self-Managing Data Integration Systems AnHai Doan, University of Illinois at Urbana-Champaign Project Description 1 Introduction ...»
CAREER: Evolving and Self-Managing Data Integration Systems
AnHai Doan, University of Illinois at Urbana-Champaign
Data integration has been a long standing challenge for the database community. Indeed, all six “white
papers” on future research directions that our community has published (in 1989-2003) acknowledged the
growing need for integrating data from multiple sources [15, 136, 137, 135, 16, 3]. This need has now become
critical in numerous contexts, including integrating data on the Web and at enterprises, building e-commerce market places, and analyzing data for scientiﬁc research .
Consequently, much research has been conducted on data integration, and many integration architectures have been proposed [56, 134, 122, 138, 139, 5, 22, 141, 61, 71] (see  for detailed surveys). A well-known and important architecture is that of virtual data integration systems, which provide a uniform query interface over a multitude of data sources [65, 98, 141, 82, 92, 63, 87, 66, 86]. Over the past decade, such systems have been researched intensively. The bulk of work has focused on architectural and query processing aspects [76, 65, 98, 141, 71, 82, 92, 9, 53, 54, 147, 125, 6, 24], and has laid down a solid foundation on which to build data integration systems. Indeed, if such systems can be now built and deployed widely, they would revolutionize the way we access data [3, 143], and provide a basis on which to build even more advanced information processing systems, such as recently proposed peer data management systems [33, 14, 69] and systems that integrate Web services .
Unfortunately, today data integration systems are still extremely hard to build and costly to maintain. They must be taught in tedious detail how to interact with the data sources and must constantly be adjusted for changes at the sources (as Section 1.1 will explain). The laborious teaching and adjustment incur huge expenses, and pose a key bottleneck for the widespread deployment of data integration systems in practice. Today, at enterprises, where data integration is frequently a must , it is carried out at a tremendous cost, often at 35% of the IT budget . On the Web, where data integration systems can vastly simplify the search for information, there are currently few such systems and at limited scales. In many domains such as ﬁre ﬁghting in rural Illinois, where data integration has been identiﬁed by the state of Illinois as crucial for eﬀective community defense , it has not been carried out due to the complexity and the high cost involved . The recent advent of languages and mediums for creating and exchanging semi-structured data, such as XML, OWL, and the Semantic Web [145, 13], will further fuel data integration applications and exacerbate the above problem. Thus it has now become critical to develop techniques that enable the eﬃcient construction and maintenance of data integration systems.
In this proposal I describe an integrated research and education plan that addresses the above problem.
I want to achieve the widespread use of data integration systems, by making them much easier to use, with far less need for human supervision. Hence, my research goal is to develop techniques to build data integration systems that learn to evolve and self manage over time. As such, this research ﬁts into the emerging paradigm of building computing systems (e.g., databases, storage devices, and Internet services) that manage themselves, motivated by the fact that the complexity of such systems is growing rapidly, and soon can make it impossible for humans to eﬀectively optimize and maintain them in real time. Prime examples of initiatives in this paradigm include autonomic computing at IBM , self-tuning databases at Microsoft, IBM Almaden, and others [79, 23, 101, 100, 142], and recovery oriented computing at Berkeley . The research also ﬁts into the grand challenge on conquering the complexity of large-scale information systems that the Computing Research Association (CRA) recently proposed .
My education goal is to build on the research to prepare students at all levels for the novel challenges posed by data processing in our Internet world, and to educate and engage the public in meeting these challenges. The proposed research and education plan thus lays the foundation for a lifetime career in Computer Science, with a focus on the eﬀective management of distributed and heterogeneous data.
1.1 The Problem I will now describe the architecture of data integration systems, then explain the construction and maintenance problem in detail. Consider a data integration system over three Web sources that list houses for sale, as shown in Figure 1. To construct such a system, the system builder begins by creating a global schema that captures the relevant aspects of the real-estate domain. The global schema may contain attributes such as address, price, and description, listing the house address, price, and description, respectively.
Next, for each data source, the builder creates a source schema that describes the content of the source, and a wrapper, which is a program that knows how to query and extract data from the source via the source’s query interface. The wrapper also knows how to translate between the source data (e.g., in HTML format) and the data that conform to the source schema (e.g., a set of tuples).
Finally, the system builder create a set of semantic mappings that relate the attributes of the global schema to those of the source schemas. (These mappings are shown as bold arrows in Figure 1.) Examples of such mappings are “attribute address of the global schema maps to attribute location of the schema of realestate.com” and “price maps to listed-price”.
Now given a user query such as “ﬁnd houses with 2 bedrooms priced under $200K”, which is formulated over the global schema, the system can use the semantic mappings to reformulate the query into queries over the source schemas. Next, it optimizes these queries, executes them with the help of the wrappers, then combines the data returned from the sources. Since in practice data sources often contain duplicate items (e.g., the same house listing) [78, 140, 47], the builder frequently must write a program to detect and eliminate duplicates from the combined data, before presenting the ﬁnal answers to the user query.
The Problem: As described, when constructing a data integration system, the system builder will have to carry out a set of fundamental tasks, such as global schema creation, wrapper construction, schema matching, and so on. These tasks are well known to be very diﬃcult [10, 126, 3, 90, 37], primarily because they require reasoning about data semantics (e.g., is house-description the same as house-style?). Even though some semi-automatic techniques have been proposed, no satisfactory solution has been found. Hence today the system builder still execute these tasks largely by hand, in an extremely labor intensive and error prone process [126, 37].
To make matters worse, in dynamic environments, such as the Web, sources often change their query interface, data formats, or presentation styles [89, 96]. Such changes can invalidate a wrapper or a semantic mapping, causing system failure. Hence, the builder must continuously monitor the deployed system, to detect and repair failures . The prohibitive cost of manual monitoring exacerbates the problem, and makes data integration systems virtually impractical on a Web scale.
1.2 The Proposed Solution Vision: To address the above problem, I propose to build data integration systems that learn to evolve and self manage. Imagine that we want to construct a system over 100 real-estate Web sources. Instead of spending months building the system, and having parts of it already stop working (due to changes at the sources) even before the system is completely built, we can start by building only a small system over say, 5, sources. This way we will soon have a system up and running.
The system then evolves by expanding to cover new data sources, and eventually will cover all 100 sources.
(The system may also evolve by “probing” the current sources under its “control” to gather more meta data, such as source latency and quality statistics [114, 116, 60]; but we leave this scenario as a future extension of the current research.) While evolving, the system maintains the sources under its “control”, by continuously monitoring them to detect and repair failures. The system will still require interaction with humans (i.e., the system builder), but at a far less amount compared to the current practice. Such interaction may be frequent at the beginning, as the system learns more, the amount of interaction decreases.
Key Idea: To realize the above vision, a key idea underlying my solution is that the system can learn from many types of information and entities in the environment to evolve and self manage. It can learn from past construction activities, data in the domain, other systems, domain knowledge supplied by the builder, behaviors of systems and sources, and even from the multitude of users. For example, when evolving, the system can learn from past schema matching activities (at the sources already within the system), to successfully match the schemas of new data sources. When self maintaining, the system can learn from the behaviors of the sources to detect failures. Suppose the system knows that two speciﬁc sources behave similarly, in that they return very similar answers to the same query (e.g., analogous to the way amazon.com and barnsandnobles.com behave with respect to book listings). Suppose further that one source starts to behave in a manner highly dissimilar to the other, then the system can predict that a source failure has happened. The research plan (Section 3) gives examples of other learning scenarios.
A very interesting type of entities that the system can learn from is the multitude of users who use the current or other systems in the domain. Indeed, whenever the system employs automatic techniques to arrive at a prediction (e.g., “this URL contains a query interface” or “the wrapper no longer extract data for price correctly”), it can ask the users to judge the correctness of the prediction. This way, the enormous burden of the system builder will be spread thinly over a mass of users.
Mass collaboration has been employed quite successfully in open-source software (e.g., Linux), product reviews (e.g., amazon.com), and collaborative ﬁltering . It has recently attracted the attention of database and data mining researchers. Raghu Ramakrishnan proposed to use mass collaboration to build tech support websites , while Rakesh Agrawal and Pedro Domingos applied it to manage user trusts in collaborative environments . In this research, I propose to consider applying it in conjunction with automatic techniques to build data integration systems.
1.3 Objectives, Feasibility, and Signiﬁcance The goal of this proposal is to make fundamental contributions toward realizing the vision, drawing from
the above core idea. Speciﬁcally, I will make contributions to the following central challenges:
System Creation & Evolution: When constructing the initial system as well as when evolving it, we have to perform a set of labor intensive tasks. How can we develop eﬀective semi-automatic techniques for these tasks, to reduce human labor? Prior research has not exploited useful information such as learning from past activities and external data. I will develop novel techniques that leverage such information to maximize task accuracy.
System Maintenance: When sources change, system components fail. How can a system detect such failures, with minimal human intervention? Very little work has been conducted on this topic, with ad hoc solutions. I will develop principled methods that learn from the current system state, the environment, and the behavior of the sources to eﬃciently detect system failures.
Mass Collaboration: Can the system further reduce the tremendous labor of the system builder by spreading much of it thinly over the mass of users? Though promising, mass collaboration has not been applied to building data integration systems. I will develop techniques to apply mass collaboration and examine its limitations.