Data Mining And Document Mining

So a company has a data and/or document warehouse. Now, what tools are needed to access the information that resides in each?

Data mining is the process of extracting previously unknown information from large databases or data warehouses and using it to make crucial business decisions. Data mining tools find patterns in the data and infer rules from them. The extracted information can be used to form a prediction or classification model, identify relations between database records, or provide a summary of the database(s) being mined. Those patterns and rules can be used to guide decision-making and forecast the effect of those decisions, and data mining can speed analysis by focusing attention on the most important variables.

Data mining is taking off for several reasons: organizations are gathering more data about their businesses, the enormous drop in storage costs, competitive business pressures, a desire to leverage existing information technology investments, and the dramatic drop in the cost/performance ratio of computer systems. Another reason is the rise of data warehousing. In the past, it was often necessary to gather the data, cleanse it, and merge it. Now, in many cases, the data is already sitting in a data warehouse ready to be used.

There are four basic mining operations supported by numerous mining techniques: predictive model creation supported by supervised induction techniques; link analysis supported by association discovery and sequence discovery techniques; database segmentation supported by clustering techniques; and deviation detection supported by statistical techniques. To these techniques one has to add various forms of visualization. Even though this tool does not automatically extract information, it helps the user to identify patterns hidden in data, as well as to better comprehend the information extracted by the other techniques. The common types of information that can be derived from data mining operations are associations, sequences, classifications, clusters, and forecasting.

Associations happen when occurrences are linked in a single event. In sequences, events are linked over time. Classification is probably the most common data mining activity today. It recognizes patterns that describe the group to which an item belongs. It does this by examining existing items that already have been classified and inferring a set of rules from them. Clustering is related to classification, but differs in that no groups have yet been defined. Using clustering, the data mining tool discovers different groupings within the data. All of these applications may involve predictions. The fifth application type, forecasting, is a different form of prediction. It estimates the future value of continuous variables based on patterns within the data. A number of tools are used in data mining. These include, but are not limited to, neural networks, decision trees, rule induction, factor analysis, genetic algorithms, and data visualization. Another tool is supervised induction. Other tools are based on combinations of these methods.

Using the described data mining tools, an organization can access and analyze the 10 percent of its information that is structured. To access the rest, a different technique is required – document mining.

Shifting Paradigms
The transition from data mining of structured information to document mining of unstructured information is not trivial. Text, collections of words meant to convey meaning to humans, is a much more complicated, subtle beast than numbers, time-series, etc., with which data mining traditionally deals. Taking a tool for data mining numbers and applying it to data mining of text is not going to work, and, whatever it is you want, you have to be very humble about what could possibly be achieved. There are very fundamental limits to our understanding of how text conveys meaning to humans. We are even more limited in understanding how text could convey meaning to machines.

The Difficulty With Documents
Though documents contain the majority of corporate information, reaching that information is difficult. The problem of searching the semantics of numerical/coded data is quite different from the problem of searching the semantics of documents, which may be text, audio, video, etc. In contrast to data mining, document management indexes documents using some predetermined model for how those documents will be retrieved in the future, and that index is searched using key word queries. Document management responds to simple questions. The query identifies the source document wherein the information is contained query and retrieval. This is not the same as discovery and decision, which we associate with data mining.

For example, a database index can be queried and all documents with a specified key word retrieved. Additional specificity may be obtained through application of some form of Boolean logic (and, or, but, not) to document indices. Provided the key word was identified in the index, all documents with the key word or words will be retrieved. If, through lack of foresight or as a result of changing requirements, the key word is not in the index, it must either be entered (hugely labor intensive) or ignored (lost opportunity). At this level, documents identified through the key word index search need to be retrieved and reviewed to identify where the key word(s) occurred for cross-correlation to other documents. This is query and retrieval against the database, not against the warehouse of data.

The next level of indenture employs data mining and pattern recognition to reveal the rules and patterns in a given database, predicting future cases on the basis of these rules. This means that somehow the documents must be reduced to traditional rows and columns, exactly what we are trying to avoid, because this mines the database and not the warehouse.

What are we left with? Perhaps the greatest ally to document mining is the Web. Here, Internet image and text search engines abound and vie for user attention. Most offer a Boolean search engine software to call up documents of interest. More sophisticated post-Boolean search engine software will find synonyms, disambiguate, and build an internal thesaurus dynamically and interpret text on the discourse level. It is easy to imagine that the Internet will have a profound impact on our ability to respond to the growing need to extract information from documents with the addition of new Web technologies. These technologies include metatags for HTML, push technology, intelligent agents, and multifunctional browsers which include all the helper applications for voice and video.

Getting The Most From Documents
In all of this we begin with a key word and are left with a document to read. Though becoming more intelligent all the time, this is still search and retrieval, not discovery and decision.

You can see the problem. Most of the advanced text-processing technology available today is aimed at the problem of information retrieval, finding a relatively small number of documents on a specific topic. As the volume of electronic text increases, the effectiveness of retrieval technology for browsing (looking for interesting information in a broader topic area) decreases.

Coming To A Document Near You
Tools will evolve that drill down into the documents at lower levels of specificity to organize information. This is the next level of indenture approaching document mining browsing documents so that the relationships and relevance among several independent issues is exposed. This is different from search and retrieval because now we can perform hundreds of queries simultaneously against a document collection using various combinations of terms. Add tools such as pattern recognition, allowing synonyms and related words to be entered, and you have the makings for document mining.

Here are a couple of quick illustrations. An insurance company puts all of its records, forms, accident reports, demographic, and statistical information into a warehouse. It finds that there is a higher per capita occurrence of minor accidents at traffic lights on Friday afternoons with red convertibles driven by women between 20 and 24 years of age in Tennessee than in California. The company can lower premiums in California and capture a larger market by putting the burden of cost in Tennessee. Too far-fetched? How about a healthcare provider that puts all patient records, technical bulletins, pharmaceutical information, and references into a warehouse and is able to correlate streptococcali in patients treated with pseudoephedrine hydrochloride verses those treated with diphenhydramine HCL at an early age to lymphatic tumors in female adults. The healthcare provider can alert the medical industry, reduce future costs, and avoid potential law suites. Last one. A retail chain that finds a correlation between the sale of beer and dippers on Friday afternoons and adjusts its marketing and inventory to increase profits. The first two examples are pure fantasy the last one is real.

The future of document mining will be determined by the availability and capability of the available tools. Parallels between data mining and document mining can be drawn, but document mining is still in the conception phase, whereas data mining is a fairly mature technology. In the realm of documents, mining document text is the most mature tool. Image, pattern recognition, audio, and video mining tools are not yet available or are in their infancy.

The basic tenant of data or document mining is to find answers to questions that you haven't thought to ask to discover information within warehouses (data or document) that queries and reports can't effectively reveal and yield information regarding associations, sequences, classifications, clusters, and forecasting to support the decision-making process. The potential payoffs from data or document mining are enormous if you pick the right tools and use them effectively.

The future will witness an explosion of modern information and database systems containing everything from medical information to the arts. And while many companies have implemented sophisticated systems to manage record-based data, the transition to electronic document management, document warehouses, and document mining has been slower and more haphazard. With 80- to 95 percent of corporate information located in paper and electronic documents, there is a substantial productivity gain to be realized by organizations that are willing to shift their view of information management from "data-centric" to "document-centric." The question is can an enterprise flourish, let alone survive, if it ignores over 80 percent of its corporate information?

Data Mining Is Not New
It utilizes venerable analytical methods induction, association, data visualization, and so on for the rather pedestrian statistical goals of predictive modeling, database segmentation, link analysis, and deviation detection. In fact, the U.S. government has been using these methods in one form or another for census and military applications since at least WWII. Some data mining techniques (most notably neural networks) are really just recycled versions of what passed as artificial intelligence (AI) just 10 years ago. In contrast to AI's promises of yesteryear, today's data mining methods, when properly implemented, are yielding the kind of practical, actionable business knowledge that AI could not.

The Warehouse Alternative: Report Mining
Report mining is here and it works. The idea is simple – instead of mining data buried in a central database, users mine data buried in reports for online access to information. The new wave of data access technology known as report mining is a hybrid approach to delivering corporate information. It borrows some of the best features of traditional hard copy reports and data warehousing tools and combines them in one package.

Many organizations have recognized the need to provide online access to data held in their operational systems. Because these systems are transaction-oriented, they typically do not support complex queries and analyses. This problem had given rise to data warehousing technology data is moved into a data warehouse or data mart that is specifically designed to facilitate online analytical processing.

While powerful, data warehouses are expensive and time-consuming to build and maintain. They require highly skilled professionals to install and operate, and can end up costing millions of dollars. Furthermore, operational data must be modeled, extracted, cleansed, and denormalized before it can even be brought into the warehouse. Report mining tools offer an alternative for online analytical processing. Instead of pulling data from a database or data warehouse, report mining tools use existing reports as a source for data. Report files that would ordinarily be sent to a printer are parsed, recognized, and transformed into live data that users can access and manipulate. The report file becomes a proxy for the database.

Advantages Of Report Mining:
  • Data is instantly available. Every report represents a ready-made database. No data restructuring is necessary.
  • Cross-platform compatibility. Because the computer industry has adopted standard conventions for sending characters to printers, report mining tools can read report files generated in virtually any computing environment. The tools don't care how, when, or where the reports were created.
  • Data security. Since report mining tools read report files and not the central database, production data stays secure and out-of-reach.
  • Existing reports are leveraged. The IS organization has put tremendous effort into building existing reports and reporting systems. Report mining tools leverage that investment.
  • Conservation of IS resources. This benefit takes several forms. Since report mining tools run on the desktop, host systems are not impacted. And fewer demands are placed on IS to produce custom reports because users can create new reports with data extracted from existing reports. Most importantly, report mining takes pressure off the IS organization to deploy warehouse technology before it is ready.