WWW.BOOK.DISLIB.INFO
FREE ELECTRONIC LIBRARY - Books, dissertations, abstract
 
<< HOME
CONTACTS



Pages:   || 2 |

«Abstract. A server, which is to keep track of heavy document traffic, is unable to filter the documents that are most relevant and updated for ...»

-- [ Page 1 ] --

Data Mining Model for the Data Retrieval from

Central Server Configuration

Srivatsan Sridharan, Kausal Malladi and Yamini Muralitharan

Department of Computer Science, IIIT - Bangalore, India.

{vatsansridharan,kausalmalladi}@gmail.com

Department of Software Engineering, IIIT - Bangalore, India.

{yams25}@gmail.com

Abstract.

A server, which is to keep track of heavy document traffic, is unable to filter the documents that

are most relevant and updated for continuous text search queries. This paper focuses on handling continuous text extraction sustaining high document traffic. The main objective is to retrieve recent updated documents that are most relevant to the query by applying sliding window technique. Our solution indexes the streamed documents in the main memory with structure based on the principles of inverted file, and processes document arrival and expiration events with incremental threshold-based method. It also ensures elimination of duplicate document retrieval using unsupervised duplicate detection. The documents are ranked based on user feedback and given higher priority for retrieval.

Keywords.

Continuous Text Queries, MapReduce Technique, Sliding Window.

1. Introduction Data intensive applications such as electronic mail, news feed, telecommunication management, automation of business reporting etc raise the need for a continuous text search and monitoring model. In this model the documents arrive at the monitoring server as in the form of a stream.

Each query Q continuously retrieves, from a sliding window of the most recent documents, the k that is most similar to a fixed set of search terms. Sliding window. This window reflects the interest of the users in the newest available documents. It can be defined in two alternative ways.

They are a) count-based window contains the N most recent documents for some constant number N, b) time-based window contains only documents that arrived within last N time units.

Thus, although a document which may be relevant to a query, it is ignored, because it may not satisfy the time and count constraints of the user. Incremental threshold method. The quintessence of the algorithm is to employ threshold-based techniques to derive the initial result for a query, and then continue to update the threshold to reflect document arrivals and expirations. At its core lies a memory-based index similar to the conventional inverted file, complimented with fast updated techniques. MapReduce technique. MapReduce is a powerful

platform for large scale data processing. This technique involves two steps namely a) map step:

The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level structure. The worker node processes the smaller problem, and passes the answer back to its master node, b) reduce step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. Unsupervised duplicate detection. [3] The problem of identifying objects in databases that refer to the same real world entity, is known, among others, as duplicate detection or record linkage. Here this method is used to identify documents that are all alike and prevent them from being prepared in the result set. Our paper also focuses on ranking the documents based on user feedback. The user is allowed to give feedback for each document that has been retrieved. This feedback is used to rank the document and hence increase the probability of the document to appear in the sliding window.

Visual Web Spider is a fully automated, multi-threaded web crawler that allows us to index and collect specific web pages on the Internet. Once installed, it enables us to browse the Web in an automated manner, indexing pages that contain specific keywords and phrases and exporting the indexed data to a database on our local computer in the format of our choice. We want to collect website links to build our own specialized web directory. We can configure Visual Web Spider automatically. This program‟s friendly, wizard-driven interface lets us customize our search in a step-by-step manner. To index relevant web pages, just follow this simple sequence of steps.

After opening the wizard, enter the starting web page URL or let the program generate URL links based on specific keywords or phrases. Then set the crawling rules and depth according to your search strategy. Finally, specify the data you want to index and your project filename.

That‟s pretty much it. Clicking on „Start‟ sets the crawler to work. Crawling is fast, thanks to multi-threading that allows up to 50 simultaneous threads. Another nice touch is that Visual Web Spider can index the content of any HTML tag such as: page title (TITLE tag), page text (BODY tag), HTML code (HTML tag), header text (H1-H6 tags), bold text (B tags), anchor text (A tags), alt text (IMG tag, ALT attribute), keywords, description (META tags) and others. This program can also list each page size and last modified date. Once the web pages have been

indexed, Visual Web Spider can export the indexed data to any of the following formats:

Microsoft Access, Excel (CSV), TXT, HTML, and MySQL script.





1.1. Key Features A Personal, Customizable Web crawler. Crawling rules. Multi-threaded technology (up to 50 threads). Support for the robots exclusion protocol/standard (Robots.txt file and Robots META tags);Index the contents of any HTML tag. Indexing rules; Export the indexed data into Microsoft Access database, TEXT file, Excel file (CSV), HTML file, MySQL script file; Start crawling from a list of the URLs specified by user; Start crawling using keywords and phrases;

Store web pages and media files on your local disk; Auto-resolve URL of redirected links;

Detect broken links; Filter the indexed data;

2. Existing System Drawbacks of the existing servers that tend to handle the heavy document traffic are: Cannot efficiently monitor the data stream that has highly dynamic document traffic. The server alone does the processing hence it involves large amount of time consumption. In case of continuous text search queries and extraction every time the entire document set has to be scanned in order to find the relevant documents. There is no confirmation that duplicate documents are not retrieved for the given query. A large amount of documents cannot be stored in the main memory as it involves large amount of CPU cost. Naïve solution: The most straightforward approach to evaluate the continuous queries defined above is to scan the entire window contents D after every update or in fixed time intervals, compute all the document scores, and report the top-k documents. This method incurs high processing costs due to the need for frequent re computations from scratch.

3. Proposed System

3.1. Problem Formulation In our model, a stream of documents flows into a central server. The user registers text queries at the server, which is then responsible for continuously monitoring/reporting their results. As in most stream processing systems, we store all the data in main memory in order to cope with frequent updates, and design our methods with the primary goal of minimizing the CPU cost.

Moreover it is necessary to reduce the work load of the monitoring server.

3.2. Proposed Solution In our solution we use the MapReduce technique in order to reduce the work load of the central server, where the server acts as the master node, which splits up the processing task to several worker nodes. The number of worker nodes, which have been assigned the processing task, depends on the nature of query that has been put up by the user. Here the master node, upon receiving a query from the user, assigns the workers to find the relevant result query set and return the solution to the master node. The master node, after receiving the partial solutions from the workers, integrates the results to produce the final result set for the given query. This can be viewed schematically in the following Fig.1. Each worker/slave node is responsible uses the incremental threshold algorithm for computing the result set of k relevant and recent documents for the given query. The overall system architecture can be viewed as in the following Fig.2 Figure. 1. System Architecture for the proposed Data Retrieval Model.

Fig. 2. A data Retrieval system using MapReduce.

Each element of the input stream comprises of a document d, a unique document identifier, the document arrival time, a composition list. The composition list contains one (t, wdt) pair for each term t belonging to T in the document and wdt is the frequency of the term in the document d. The notations in this model are as follows in Fig 3.

Figure. 3. A Detailed list of the notations.

The worker node maintains an inverted index for each term t in the document. With the inverted index, a query Q is processed as follows: the inverted lists for the terms t belonging to Q are scanned and the partial wdt scores of each encountered document d are accumulated to produce S(d/Q). The documents with the highest scores at the end are returned as the result.

3.3. Incremental Threshold Algorithm Fig.3 represents the data structures that have been used in this system. The valid documents D are stored in a single list, shown at the bottom of the figure. Each element of the list holds the stream of information of document (identifier, text content, composition list, arrival time). D contains the most recent documents for both count-based and time-based windows. Since documents expire in first-in-first-out manner, D is maintained efficiently by inserting arriving documents at the end of the list and deleting expiring ones from its head. On the top of the list of valid documents we build an inverted index. The structure at the top of the figure is the dictionary of search terms. It is an array that contains an entry for each term t belonging to T.

The dictionary entry for t stores a pointer to the corresponding inverted list L t. Lt holds an impact entry for each document d that contains t, together with a pointer to d‟s full information in the document list. When a document d arrives, an impact entry (d, wdt) (derived from d‟s composition list) is inserted into the inverted list of each term t that appears in d. Likewise, the impact entries of an entries of an expiring document are removed from the respective inverted lists. To keep the inverted lists sorted on wdt while supporting fast (logarithmic) insertions and deletions.



Pages:   || 2 |


Similar works:

«Ber. Inst. Erdwiss. K.-F.-Univ. Graz ISSN 1608-8166 Band 20/2 Graz 2014 th Graz, 14-19 September 2014 PANGEO Austria Excursion 2 Neogene of the Styrian Basin Markus Reuter, Werner, E. Piller Institut für Erdwissenschaften, Universität Graz, NAWI Graz, Heinrichstrasse 26, 8010 Graz, Austria Introduction The Styrian Basin, as a subbasin of the Pannonian Basin System, established during the Neogene at the eastern margin of the Eastern Alps. It is about 100 km long, about 60 km wide and contains...»

«Automorphisms of 3-Dimensional Handlebodies Ulrich Oertel∗ Revised May, 2000 Rutgers University, Newark Abstract This paper gives a classification up to isotopy of automorphisms (self-homeomorphisms) of 3-dimensional handlebodies and compression bodies, analogous to the Nielsen-Thurston classification of automorphisms of surfaces. Indecomposable automorphisms analogous to pseudo-Anosov automorphisms are identied and called generic. The first steps are taken towards understanding generic...»

«16. Internationales Holzbau-Forum 10 Tropenholz: Verantwortungsbewusster Einschlag und Wettbewerb | A. Wesselink Tropenholz: Verantwortungsbewusster Einschlag und Wettbewerb tropical timber: responsible harvest and commercial competition bois tropical : responsable volume d’explotation et concurrence madera tropical: responsable cobranza y competencia Ad Wesselink Wijma Kampen B.V. Tochterfirma von Koninklijke Houthandel G. Wijma & Zonen B.V. Holland Kaupen, Holland 16. Internationales...»

«d ELVIN JONES’ DETROIT YEARS John Cohassey Nearly a decade ago, a drummer friend of mine attended a Detroit music clinic and confessed to jazz great Elvin Jones: “Everything I learned to play on the drums I stole from you.” Jones smiled and assured him: “You didn’t steal anything. It’s a gift.” This gesture was indicative of Jones’ willingness to share his art, a musical gift made unique through an uncompromising spirit that refused to “comply,” said Jones, “to the...»

«Praktikum in Histologie Praktikum zur Vorlesung Anatomie und Physiologie I+II ETH 557-0154-00 UZH BIO145 Frühjahrssemester 2012 L. Slomianka D.P. Wolfer Checkliste Präparationstechnik, Epithelgewebe I Epithelgewebe II (Drüsen) Bindegewebe Knorpel und Knochen Muskelgewebe Nervengewebe Blutausstrich / Akute myeloische Leukämie, Lunge / Emphysem Leber / Leberzirrhose, Niere / Glomerulonephritis Mamma / Mama-Ca, Colon / Hirnmetastase Praktikum Histologie | Checkliste | Seite 1 von 10...»

«Sicherheitsdatenblatt gemäß EU-Verordnung 453/2010 Seite 1 von 7 Ausstellungsdatum : 24.09.2012 Ersatz für das Datenblatt von : Änderungen gegenüber Vorläufer, n.a. = nicht anwendbar, n.v. = nicht verfügbar ABSCHNITT 1: Bezeichnung des Stoffs bzw. des Gemischs und des Unternehmens 1.1 Produktidentifikator Handelsname : Biplantol X2 forte 66, 67, 68 Artikel Nr. : n.v. Rezeptur Nr. : n.v. Registriernummer : 1.2 Relevante identifizierte Verwendungen des Stoffs oder Gemischs und...»

«Erik Margraf Die Hochzeitspredigt der Frühen Neuzeit Mit einer Bibliographie der selbstständig erschienenen Hochzeitspredigtdrucke der Herzog-August-Bibliothek Wolfenbüttel, der Staatsund Stadtbibliothek Augsburg und der Universitätsbibliothek Augsburg Herbert Utz Verlag · München Geschichtswissenschaften Band 16 Zugl.: Diss., Augsburg, Univ., 2005 Bibliografische Information Der Deutschen Bibliothek: Die Deutsche Bibliothek verzeichnet diese Publikation in der Deutschen...»

«Ecology and Reproductive Cycles of the Introduced Gecko, Hemidactylus turcicus, in the Southern United States FRANCIS L. ROSE and CLYDE D. BARBOUR Reprinted from THE AMERICAN MIDLAND NATURALIST Vol. 79, No. 1, pp. 159-168, January, 1968, University of Notre Dame Press Notre Dame, Indiana i Ecology and Reproductive Cycles of the Introduced Gecko, Hemidactylus turcicus, in the Southern United States FRANCIS L. ROSE and CLYDE D. BARBOUR V Reprinted from THE AMERICAN MIDLAND NATURALIST Vol. 79,...»

«Tätigkeitsbericht Max-Planck-Institut für ausländisches und internationales Privatrecht Hamburg Max-Planck-Institut für ausländisches und internationales Privatrecht Hamburg Tätigkeitsbericht Berichterstatter: Phillip Hellwege Max-Planck-Institut für ausländisches und internationales Privatrecht, Mittelweg 187, 20148 Hamburg; Telefon 040/41900–0;Telefax 040/41900–288; Internet: http://www.mpipriv-hh.mpg.de. Wissenschaftliche Mitglieder, Direktoren: Prof. Dr. Dr. h.c. Jürgen...»

«Sächsische Landesanstalt für Landwirtschaft Fachbereich 4 Pflanzliche Erzeugung Referat 42 Bodenkultur 04159 Leipzig, Gustav-Kühn-Str. 8 Internet: http://www.boden.sachsen.de Bearbeiter: Dr. Walter Schmidt E-Mail: Walter.Schmidt@smul.sachsen.de Tel.: 0341-9174 116 Fax: 0341-9174 111 Erstellung eines schlagbezogenen Erosionsschutzkonzeptes unter Zuhilfenahme von EROSION 2D 1. Einführung Im Jahr 2002 wurde durch die Sächsische Landesanstalt für Landwirtschaft eine Anleitung zur Erstellung...»





 
<<  HOME   |    CONTACTS
2016 www.book.dislib.info - Free e-library - Books, dissertations, abstract

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.