«DATA, TEXT, AND WEB MINING FOR BUSINESS INTELLIGENCE: A SURVEY Abdul-Aziz Rashid Al-Azmi Department of Computer Engineering, Kuwait University, ...»
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
DATA, TEXT, AND WEB MINING FOR BUSINESS
INTELLIGENCE: A SURVEY
Abdul-Aziz Rashid Al-Azmi
Department of Computer Engineering, Kuwait University, Kuwait
The Information and Communication Technologies revolution brought a digital world with huge amounts of data available. Enterprises use mining technologies to search vast amounts of data for vital insight and knowledge. Mining tools such as data mining, text mining, and web mining are used to find hidden knowledge in large databases or the Internet. Mining tools are automated software tools used to achieve business intelligence by finding hidden relations, and predicting future events from vast amounts of data.
This uncovered knowledge helps in gaining completive advantages, better customers’ relationships, and even fraud detection. In this survey, we’ll describe how these techniques work, how they are implemented.
Furthermore, we shall discuss how business intelligence is achieved using these mining tools. Then look into some case studies of success stories using mining tools. Finally, we shall demonstrate some of the main challenges to the mining technologies that limit their potential.
KEYWORDSbusiness intelligence, competitive advantage, data mining, information systems, knowledge discovery
1. INTRODUCTION We live in a data driven world, the direct result of advents in information and communication technologies. Millions of resources for knowledge are made possible thanks to the Internet and Web 2.0 collaboration technologies. No longer do we live in isolation from vast amounts of data.
The Information and Communication Technologies revolution provided us with convenience and ease of access to information, mobile communications and even possible contribution to this amount of information. Moreover, the need of information from these vast amounts of data is even more pressing for enterprises. Mining information from raw data is an extremely vital and tedious process in today’s information driven world. Enterprises today rely on a set of automated tools for knowledge discovery to gain business insight and intelligence. Many branches of knowledge discovery tools were developed to help today’s competitive business markets thrive in the age of information. World’s electronic economy has also increased the pressure on enterprises to adapt to such new business environment. Main tools for getting information from these vast amounts are automated mining tools, specifically speaking data mining, text mining, and web mining.
Data Mining (DM) is defined as the process of analysing large databases, usually data warehouses or internet, to discover new information, hidden patterns and behaviours. It’s a
databases, in multiple dimensions and angles, producing a summary of the general trends found in the dataset, relationships and models that fits the dataset. DM is a relatively new interdisciplinary field involving computer science, statistical modelling, artificial intelligence, information science, and machine learning . One of the main uses of DM is business intelligence and risk management . Enterprises must make business critical decisions based on large datasets stored in their databases, DM directly affect decision-making. DM is relied on in retail, telecommunication, investment, insurance, education, and healthcare industries they are data-driven. Other uses of DM includes biological research such as DNA and the human genome project, geospatial and weather research for analysing raw data used to analyse geological phenomenon.
A related field is Text Mining (TM), which deals with textual data rather than records. TM is defined as automatic discovery of hidden patterns, traits, or unknown information from textual data . Textual data makes up huge amounts of data found on World Wide Web WWW, aside from multimedia. TM is related field to DM, but differs in its techniques and methodologies used.
TM is also an interdisciplinary field encompassing computational linguistics, statistics, and machine learning. TM uses complex Natural Language Processing (NLP) techniques. It involves a training period for the TM tool to comprehend patterns and hidden relations. The process of mining text documents involve linguistically and semantically analysis of the plain text, thus structuring the text. Finally relates and induces some hidden traits found in the text, like frequency of use for some words, entity extractions, and documents summarizations. TM is used, aside from business applications, for scientific research, specifically medical and biological .
TM is very useful in finding and matching proteins’ names and acronyms, and finding hidden relations between millions of documents.
The other mining technique is Web Mining (WM). WM is defined as automatic crawling and extraction of relevant information from the artefacts, activities, and hidden patterns found in WWW. WM is used for tracking customers’ online behaviour, most importantly cookies tracking and hyperlinks correlations. Unlike search engines, which send agents to crawl the web searching for keywords, WM agents are far more intelligent. WM work by sending intelligent agents to certain targets, like competitors sites’ . These agents collect information from the host web server and collect as much information from analysing the web page itself. Mainly they look for the hyperlinks, cookies, and the traffic patterns. Using this collected knowledge enterprises can establish better customer relationships, offers and target potential buyers with exclusive deals.
The WWW is very dynamic, and web crawling is repetitive process where contentious iteration will achieve effective results. WM is used for business, stochastic, and for criminal and juridical purposes mainly in network forensics.
In this survey paper, we shall look at the main mining technologies used through information systems for business applications to gain new levels of business intelligence. Furthermore, we shall look at how these techniques can help in achieving both business leadership and risk management by illustrating real enterprises’ own experience using mining techniques. In addition, we shall look at the main challenges facing data, web, and text mining today.
2. HISTORY AND BACKGROUNDMany developments were made leading to mining technologies we have today. These developments date back to early days of mathematical models and statically analysis using regression and Bayesian methods in mid-1700s. With the advent of commercial electronic International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013 computers after World War II, large data sets were stored into magnetic tapes to automate the work. In the 1960s were data stored in computers helped analysers to answer simple predictive questions. With the development of programming languages, specifically COmmon Business Oriented Language or COBOL, and Rational Database Management Systems RDBMS, querying databases were possible. Meaning more complex information and knowledge can be extracted.
Development of advanced object oriented languages such as C++, Java, multi-dimensional databases, data warehousing, and Online Analytical Processing OLAP made way for an automated algorithmic way of extracting patterns, knowledge from such large data sets. DM tools today are more advanced and provide more than reporting capabilities, they can discover hidden patterns and knowledge. These DM tools were developed in the 1990s.
After the Internet and the WWW revolution in the early 1990s, many research and developments were made to automate the search and exploration of the net, especially text, found in the URLs.
Developments in NLP, neural networks and text processing led ultimately to search engines development. The need for better search algorithms led to textual exploration of web pages.
These developments greatly enhanced the search engines and opened the door for text mining to be applied in several other applications. Search engines’ technologies were centred on agents that could map the vast WWW and correlate keywords and similar other possible keywords. These developments will lead to the more intelligent agents that search the WWW for not only keywords but also site visitors’ patterns. Ultimately, the developments in both DM and TM lead to the notion of WM, were the WWW is used as a source for looking for new knowledge, hidden away somewhere. WM agents are small standalone software, that crawl the WWW, acquiring logging data, cookies, and site visits behaviour found on the servers and other machines attached to the WWW.
The tremendous advancements made in the mining technologies have shifted thought from data collection to knowledge discovery and collection . With today’s powerful and relatively inexpensive hardware and network infrastructure, matched with advanced software for mining, enterprises are adapting mining technologies as essential business processes. In addition, the Internet has an integral role as network and communications are ubiquitous today, mining is carried over the world through the network of databases. The vast amount of knowledge is not only consumed at the top senior management level but at all the other levels of an enterprise as well.
Today mining software utilizes complex algorithms for searching, pattern recognition, and forecasting complex stock market changes. IBM and Microsoft are on an epic race to produce best DM software to date; this is also influenced by security and intelligence agencies such as FBI and CIA. Multi-linguistic and semantic TM is a hot new research topic. As modern as it is today, WM has become an increasingly adopted business process as well. WM is suited more for ecommerce than DM and TM. The nature of e-commerce suggests the direct exploitation of customers’ online behaviours. Many surveyors, such as Gartner Group, predict that over 5 billion dollars of business will be net worth of e-commerce in the coming years . WM is heavily used for e-education and e-business, as the WWW is again their main platform. As developments were huge in the 1990’s in terms of hardware support for mining techniques and the further leaps achieved by modern software, mining techniques are more of a must than a commonplace for modern business today. Relatively new and emerging mining techniques are what are known collectively as Reality Mining . Reality mining is the collection of transactions made daily by individuals to realize how they live and react. Reality mining is aimed at developing our understanding of our modern societies, economies and politics. This is technology is made International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013 possible by the ICT world we live in today. Reality mining which is very controversial as it infiltrate individuals privacy, is catching the intention of governments and corporate, as it can be used for potential business benefits. Reality mining really mines what is known as reality traces, these include all patterns of human life in digital form. Traces include banking transactions, travel tickets, mobile telecommunications calls, blogs, and every possible digital transaction. The aim of such emerging technology is to better understand societies as well as individual and to further develop solutions aimed at them. The main problem facing such new mining technology is privacy concerns from individual, and governments, as data spread on the Internet is not really owned by any legislative body.
3. RELATED WORK Much work was done in surveying business applications of the aforementioned mining techniques. However, most work considers each mining technique separate from one another. In  the authors have provided an overview of Knowledge Discovery in Databases (KDD) approaches. They also classified the approaches depending on software characteristics. In  the authors demonstrated how modern technologies shifted the process of decision-making, from manual data analysis using modelling and stochastic to an automated computer driven process.