History of Text Data mining

PDF
Labor-intensive manual text-mining approaches first surfaced in the mid-1980s, but technological advances have enabled the field to advance swiftly during the past decade. Text mining is an interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (common estimates say over 80%) is currently stored as text, text mining is believed to have a high commercial potential value.
Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning.

Swanson - Father of modern text mining

Swanson's system remains far from fully automated, it is highly medical domain-specific, and to my knowledge Swanson has never referred to it as text mining. It meets the criteria at least partially, and Swanson has been recognized as an early pioneer by self-described text mining practitioners Marti Hearst (1999) and Ronald Kostoff (1999).

Text Mining history

  • H. P. Luhn (1958), in a seminal paper on automatic abstracting, noted "the resolving power of significant words" in primary text. Lauren B. Doyle (1961) also captured the spirit of text mining and related methods when he said that "natural characterization and organization of information can come from analysis of frequencies and distributions of words in libraries" ("libraries" representing what we would now more generally call collections or corpora). Text mining per se may be new, but the dream of training a computer to extract information from "mountains" of textual data is nearly as old as IR itself.
  • Don R. Swanson (1988) articulated the idea that the scientific literature should be regarded as a natural phenomenon worthy of "exploration, correlation, and synthesis." He contrasted scientists' attitudes toward information usage with those of intelligence analysts.
  • 'To the working scientist or engineer, time spent gathering information or writing reports is often regarded as a wasteful encroachment on time that would otherwise be spent producing results that he believes to be new' [Weinberg et al, 1963] …. The intelligence analyst, by contrast, is much more intimate with the available base of recorded information. New knowledge, or finished intelligence, is seen as emerging from large numbers of individually unimportant but carefully hoarded fragments that were not necessarily recognized as related to one another at the time they were acquired. Use of stored data is intensively interactive; "information retrieval" is an inadequate and even misleading metaphor. The analyst is continually interacting with units of stored data as though they were pieces selected from a thousand scrambled jigsaw puzzles. Relevant patterns, not relevant documents, are sought.
  • Swanson called upon scientists to be more like intelligence analysts; to "take seriously the idea that new knowledge is to be gained from the library as well as the laboratory [and] to develop attitudes toward information indistinguishable from attitudes toward research itself."
  • Not content to lecture scientists from a theoretical pedestal, by the time these words were published Swanson had already put the idea into practice by developing a system to discover meaningful new knowledge in the biomedical literature. The Software now called ARROWSMITH and helps by finding common keywords and phrases in "complementary and noninteractive" sets of articles or "literatures" and juxtaposing representative citations likely to reveal interesting co-occurrences. Two literatures are "complementary if together they can reveal useful information not apparent in the two sets considered separately" – e.g., one may reveal a natural relationship between A and B, and the other a relationship between B and C, so that together they suggest a relationship between A and C. The two literatures are "noninteractive" if their articles do not cross-cite and are not co-cited elsewhere in the literature. Swanson has discovered at least three biomedically important relationships using this system: between fish oil and Raynaud's syndrome, magnesium and migraines and epilepsy, and arginine and somatomedin C (Lindsay & Gordon, 1999). Most recently he has used it to identify several dozen viruses as potential bioweapons (Swanson, Smalheiser, & Bookstein, 2001).
 
Home Mining History of Text Data mining