Ten years ago, it appeared that corporations were about to enter the golden age of information management. For decades, businesses had been gathering data about their customers, expecting that someday they would find a way to make sense of it.
Finally, it seemed, that day arrived in the mid-1990s when developers of databases grandly announced that they were on the verge of creating a solution that would gather all of a company¡¯s unstructured data into a central repository.
What happened? At most companies, very little. In fact, analysts calculate that 85 percent of corporate data is still unstructured.
But now, according to a Reuters report,1 there is reason for renewed optimism. IBM has developed new search technologies that will simplify the way people can scour the data inside the corporation for the information they need. Instead of using keywords, the new tools rely on facts and concepts. Most significantly, they actually analyze the information to discover subtle relationships, facts, and ideas that are hidden in the unstructured data.
Structured information is the data that is stored in databases, such as personnel files, shipping information, and so on. Most of the corporate world¡¯s data is unstructured, in the form of e-mail, memos, newspaper articles, reports, and anything else that is not entered into a database.
IBM¡¯s new technology is called Unstructured Information Management Architecture, or UIMA for short. With UIMA, the company¡¯s executives believe they have the market to themselves, at least for the moment. The head of search technology at IBM Research, Arthur Ciccolo, told Reuters,2 ¡°I don¡¯t see any of the major players moving into this area,¡± including Google, Microsoft, and Yahoo. Those companies¡¯ search engines allow users to search the Internet, not an individual company¡¯s data.
It took four years for IBM Research to develop UIMA, with support from the U.S. Defense Advanced Research Projects Agency (DARPA). Other organizations that were involved in the research include several leading universities, The Mayo Clinic, and three defense contractors: Science Applications International Corporation, BBN Technologies, and Mitre Corporation.
Another tool that IBM has recently introduced is WebSphere OmniFind. With this software, users can search for information stored in unstructured data in a variety of languages and formats. For example, a manager doing research for a report could search for data in the company¡¯s databases, e-mail archives, videos, pictures, and audio files.
Meanwhile, software tools developed at The University of Manchester¡¯s Institute of Science and Technology, or IST, are promising similar benefits. The new tools are called PARMENIDES.
According to an ISTpress release,3 the Greek Ministry of Defense used the tools to search its own intelligence files and published newspaper reports about terrorist attacks to create profiles of suspected terrorists. Until now, intelligence agents had to read every article; now the software combs the newspaper reports automatically, sifting out the useful data. By combining all of the data, analysts might discover that a terrorist group is changing its strategy from planting car bombs to staging suicide attacks.
Unilever uses PARMENIDES to gather and analyze data from newspaper articles and reports in scientific journals to create a comprehensive portrait of the connections between people¡¯s health, weight, and food.
The software tools work by making connections between words through the use of ontologies. An ontology is a list of all the words related to a specific subject, such as military intelligence, health care, or consumer complaints. With the help of ontologies, computers can recognize each word in its context to achieve a level of ¡°understanding.¡±
According to the Institute of Science and Technology, PARMENIDES uses one ontology to analyze unstructured text, another to analyze databases, and a third to unify the two by data sets. While a newspaper might talk of a ¡°terrorist¡± or ¡°bomber,¡± a military database might use the terms ¡°hostile¡± or ¡°enemy agent¡± or specific names of terrorists.
Each data type has its own ontology for the context, in this case terrorism. A third ontology harmonizes the two, and that enables PARMENIDES to create the framework.
Based on this trend, we foresee three compelling developments:
First, expect UIMA to become the standard technology for corporate data retrieval. IBM will dominate this market for reasons that go beyond the first-mover advantage. Instead of trying to protect its proprietary technology, the company¡¯s strategy is to make UIMA open to other software firms, literally giving it away as a free download. As a result, it is likely to be adopted by a wide variety of software firms that are developing programs which allow business users to search for data, analyze text, and manage knowledge. Among the more than a dozen companies that are already using UIMA are ClearForest, Factiva, SAS, and Schemalogic.
Second, even the early applications for the new software tools offer the potential to produce extraordinary results. Consider the problem of catching quality control problems before they become widespread, leading to high-cost recalls and damage to a company¡¯s brand image. Businesses can now use a bundle of software from IBM, ClearForest, Attensity, iPhrase, and Kana to search the Internet for consumer complaints about their products and to find data within the company that can be used to fix the problem. Another early application of the tools is to analyze unstructured data and develop business intelligence. For example, BioVista, a consulting firm for the biotech industry, uses PARMENIDES to search the data in the public domain, such as help wanted ads and press releases of various companies, to figure out the changes in the companies¡¯ research priorities and to predict which new drugs each firm is trying to develop.
Third, ultimately, tools like UIMA and PARMENIDES will give way to even more powerful and accurate search engines for sifting through unstructured data. With the aid of far more sophisticated ontologies, the next generation of tools will enable users to enter a search term in a simple interface that will yield results from databases within the corporation, from the archives of the world¡¯s libraries, from the entire body of published literature, and from every movie, song, painting, and photograph ever reproduced. In addition, highly advanced filtering tools will reduce the millions of potential hits to the handful that are most relevant to the user, and analytical tools will automate much of the work that humans currently do in making connections and predictions based on the raw data.
References List : 1. Reuters, August 8, 2005 ¡°Search Concepts, Not Keywords, IBM Tells Business,¡± by Eric Auchard. ¨Ï Copyright 2005 by Reuters. All rights reserved. 2. ibid. 3. For information about PARMENIDES, visit the Information Society Technologies website at:istresults.cordis.lu