...
.....

DARPA TIPSTER Activities (TREC and MUC)
Comparison of TREC to Our Approach
Review of Prevailing IR Techniques

I. Relevance Feedback and Probabilistic Models in Information Retrieval
II. Knowledge-Based Systems for IR
III. Learning Systems for IR

Neural Networks
Symbolic Learning

Scalable semantics represents both a conceptual model for semantic interoperability and a concrete path towards practical algorithms. The model is based upon the levels of abstraction needed to represent knowledge, while the algorithms are based upon statistical correlations on the items at each level, which compute semantic relationships. The services provided by these algorithms thus generate indexes useful for semantic retrieval.

Our preliminary results from working with concept spaces have shown a way towards scalable semantics. That is, unlike the domain-specific heuristic rules used in promising message understanding systems (e.g., MUC projects in TIPSTER), the technology for concept spaces is domain-independent. Consistent with new research results generated by successful information retrieval and filtering systems (e.g., INQUERY system in TREC of TIPSTER), query expansions using concept spaces were found to significantly improve search performance. Moreover, concept spaces were able to help find other semantics-rich textual and image "concepts" automatically instead of simply other relevant syntactic phrases (e.g., PhraseFinder in INQUERY).

The proposed concept space and category map research relies on fast computation of traditional statistical text analysis, such as co-occurrence matrices, combined with categorization techniques for concept comparisons using recently matured mathematical AI pattern analysis (e.g., Hopfield nets and Kohonen maps). The level of semantics is thus more shallow than parsing the sentences for meaning extraction, as in computational linguistics, but the algorithms are fully automatic without domain heuristics.

We believe that concepts are the "right" generic level for semantic protocols for the foreseeable future state of technology. Concepts provide some degree of semantics and automatic indexing and noun phrase extraction appear computationally feasible for large collections of diverse materials. Concept spaces actually provide semi-automatic categorization. They are useful for interactive retrieval, with suggestion by machine but selection by human. All our experience thus far indicates that augmentation not automation of human performance is what is technologically feasible for semantic interoperability in digital libraries.

Artificial intelligence provides too high a level for semantics. Semantic analysis with natural language processing does provide deeper semantics but has proven to be fragile. Such techniques depend on specific heuristics and lexicons which are not transferable to other domains, although they are effective for semantics within a particular domain. The heuristics-based message parsing technology from the Tipster projects, for example, was effective for newspapers but has not proven applicable to general purpose subject domains. It is scalable in number of objects, but not in number of domains. Thus, information analysis across multiple subject domains requires a different technology.

Information retrieval provides too low a level for semantics. It concentrates on being fast and dumb and general rather than slow and smart and specific. Traditional retrieval indexes words from text documents and supports phrase matching as the fundamental operation. This is the technology used for the current search systems within the Web, such as the commercial spin-offs from the DARPA CSTR projects like Lycos and Yahoo. While they scale to large collections, they cannot provide deeper semantics since the base objects are text documents (e.g., HTML with images). To support correlation in the world of a billion repositories, indexing must take place at a deeper level to better represent the structure and the semantics of the objects.

Concept spaces and category maps are generated using statistical information retrieval techniques, but are correlated using artificial intelligence pattern analysis techniques. They are pitched at an abstract concept rather than a concrete object level. The trick is to show that the manipulation techniques remain effective across objects and across domains as the scale increases. We believe that we have identified a happy medium of practical scalable semantics and that embedding concept spaces and category maps within a complete analysis environment will provide new fundamental infrastructure for network information systems.

DARPA TIPSTER Activities (TREC and MUC)

To hedge our bets, however, for a complete analysis environment for the Interspace, we present a survey below of other approaches to semantic retrieval, which we will carefully track in case some scalable technology other than concept spaces and category maps arise.

In particular, we are collaborating with an algorithms project on the DARPA TIPSTER project, with PI Paul Kantor of Rutgers. He is looking at deeper mathematical approaches to vocabulary switching and data fusion, among other linguistic investigation. He will be our bridge into any generic techniques that arise from the TIPSTER world that are suitable for scalable analysis across types and subjects. Kantor will cross-reference our collaboration in his DARPA project proposals to TIPSTER.

The DARPA-funded TIPSTER projects consist of three main areas of research: TREC (text retrieval and routing), MUC (message understanding), and MET (multilingual task). TREC and MUC, in particular, are most relevant to our proposed research.

Many advanced systems presented in the recent TREC-3 and TREC-4 Conferences help reveal the state of the art in ``intelligent'' information retrieval and filtering. These systems present a good sampling of the various techniques discussed above and also inherit the problems associated with their underlying techniques. A brief comparison of representative systems with our proposed concept space and category map techniques is provided below.

Recent TREC-3 and TREC-4 results have shown that automatic or manual expansion of search topics using either automatic thesaurus or a selection of terms from the top retrieved documents is effective for searching and routing tasks. The use of passage retrieval, subdocuments, and local weighting also brings consistent performance improvements.

Croft's INQUERY system (and the related InRoute system) [Broglio, Callan, Croft and Nechbar 1995] has been consistently ranked high in TREC evaluations. INQUERY performs sophisticated query processing and combine multiple sources of evidence for query expansion. INQUERY uses the JTAG tagger to identify parts of speech for words in the query (2-word phrases only). A PhraseFinder technique was then adopted for corpus-based query expansion. Phrase are added from a PhraseFinder database to the original query. Based on tf*idf and default probability, INQUERY adjusts its weights for queries.

Similar to INQUERY, the Cornell's SMART system [Buckley, Saltonl, Allan, and Singhal 1995], which is based on Salton's vector space model, adopts various statistical weighting algorithms and performs massive relevance feedback (300+ terms) to expand user queries. Automatic relevance feedback is typically based on simple addition and removal of keywords in user queries. No concept inferencing or induction is performed. Kwok's PRICS [Kwok, Grunfeld, and Lewis 1995] is also based on probabilistic retrieval and suffers problems similar to those of INQUERY and SMART.

In the Message Understanding Workshop (MUC) and Multilingual Entity Tasks (MET), lexicons, part-of-speech-taggers (POSTs), and linguistic parsing rules have been developed to extract noun phrase entities (e.g., names of organizations, persons, locations) and generic objects and slots in unstructured text -- an approach adopted by such successful systems developed at SRA, SRI, and BBN. The state of the art in scalable, domain-independent text parsing seems to point conclusively to use of NLP phrase extraction.

GE/NYU's natural language retrieval system [Strzalkowski, Carballo and Marinescu 1995] and Jacobs' (GE) SCISOR system [Jacobs and Rau 1993], are based on natural language processing, and, as in other NLP-based systems, their performances are often sub-par for general-domain (unrestricted) applications and the hand-crafted lexicons or parsing rules are fragile for large-scale general-purpose information access.

Comparison of TREC to Our Approach

The validity and usefulness of query expansion are confirmed in our digital library research. In particular, our proposed concept space techniques are similar to the INQUERY PhraseFinder database in principle. However, several significant improvements have been made in our research.

Firstly, our concept space techniques do not need to have any a priori default probabilities (which are difficult to obtain) and thus are more scalable and efficient.

Secondly, several domain-independent noun phrase extraction techniques have been developed in our research, which include phrase extraction automatic indexing techniques (up to 5-word phrases, unlike the two-word phrases in INQUERY) and natural language-based linguistic noun phrase extraction techniques (any numbers of words for syntactically-correct phrases). Currently, we have experimented with the MIT Media Lab Chopper (based on Princeton's WordNet), the Brill parser (University of Pennsylvania), and the NPTool parser (Helsinki University).

Thirdly, none of the TREC or MUC systems are designed to perform categorization or classification of topics like our proposed category map algorithms. Categorization, however, is an analysis process critical to vocabulary switching and semantic mapping.

Fourthly, the proposed concept space and category map techniques have been optimized for efficient large-scale computations (based on our earlier work in high-performance computing). In addition, several versions of the proposed techniques will be available for batch processing, incremental update, and real-time on-the-fly term suggestions.

Lastly, unlike TREC or MUC research which was designed for textual applications only, the proposed research is designed for a general and scalable analysis environment for both textual and image analysis. Several texture and region based image indexing techniques will be integrated with the concept space and category map algorithms.

In summary, we believe our proposed research is more scalable (in domains, media types, techniques, and sizes) than previous TIPSTER research, and yet it extracts semantics at a level suitable for large-scale algorithmic computation. Such a "scalable semantics" approach is unique and will be able to help create a powerful concept-based analysis environment.

Review of Prevailing IR Techniques

I. Relevance Feedback and Probabilistic Models in Information Retrieval

In classical information retrieval models, relevance feedback, document space modification, probabilistic techniques, and Bayesian inference networks are among the techniques most relevant to our research.

One method for automatically generating improved query formulations is the well-known relevance-feedback process [Salton 1989] [Rocchio 1971] [Ide 1971] [Ide and Salton 1971]. A query can be improved iteratively by taking an available query vector (of terms) and adding terms from the relevant documents, while subtracting terms from the irrelevant documents. A single iteration of relevance feedback usually produces improvements of from 40 to 60 percent in search precision [Salton 1989]. A similar approach can also be used to alter the document representation. Document-vector modification changes and improves document indexes based on the user relevance feedback of relevant and irrelevant documents [Brauen 1971].

In probabilistic information retrieval, the goal is to estimate the probability of relevance of a given document to a user with respect to a given query. Probabilistic assumptions about the distribution of elements in the representations within relevant and irrelevant documents are required. Using relevance feedback from a few documents, the model can be applied in order to estimate the probability of relevance for the remaining documents in a collection [Fuhr and Buckley 1991] [Fuhr and Pfeifer 1994] [Gordon 1988]. In order to simplify computation, an assumption is usually made that terms are distributed independently [Maron and Kuhns 1960].

The use of Bayesian classification and inference networks for information retrieval and indexing represents an extension of the probabilistic models [Maron and Kuhns 1960] [Turtle and Croft 1990]. The basic inference network consists of a document network and a query network [Turtle and Croft 1990] [Turtle and Croft 1991] [Tzeras and Hartmann 1993] that is intended to capture all of the significant probabilistic dependencies among the variables represented by nodes in the document and query networks. In [Turtle and Croft 1991], Turtle and Croft showed that given equivalent document representations and query forms, the inference network model performed better than conventional probabilistic models.

Although relevance feedback and probabilistic models exhibit interesting query or document refinement capabilities, their abstraction processes are based on either simple addition/removal of terms or probabilistic assumptions and principles. Their learning behaviors are somewhat limited in comparison with those developed in symbolic machine learning and neural networks. In addition, many probabilistic assumptions and constraints (e.g., independence, sampling) many not hold in real-life IR sessions.

II. Knowledge-Based Systems for IR

Creating computer systems with knowledge or ``intelligence'' has long been the goal of researchers in artificial intelligence. Many interesting knowledge-based systems have been developed in the past few decades for such applications as medical diagnosis, engineering troubleshooting, and business decision making [Hayes-Roth and Jocobstein 1994]. Most of these systems have been developed based on the manual knowledge acquisition process, a significant bottleneck for knowledge-based systems development.

CoalSORT [Monarch and Carbonell 1987], a knowledge-based system, facilitates the use of bibliographic databases in coal technology. A semantic network, representing an expert's domain knowledge, embodies the system's intelligence. PLEXUS [Vickery and Brooks 1987], developed by Vickery and Brooks, is an expert system that helps users find information about gardening. Natural language queries are accepted. The system has a knowledge base of search strategies and term classifications similar to a thesaurus. GRANT [Cohen and Kjeldsen 1987], developed by Cohen and Kjeldsen, is an expert system for finding sources of funding for given research proposals. Its search method - constrained spreading activation in a semantic network - makes inferences about the goals of the user and thus finds information that the user has not explicitly requested but that is likely to be useful. Fox's CODER system [Fox 1987]consists of a thesaurus that was generated from the Handbook of Artificial Intelligence and Collin's Dictionary.

The ``Intelligent Intermediary for Information Retrieval'' (I3R), developed by Croft [Croft and Thompson 1987], consists of a group of ``experts'' that communicate via a common data structure, called the blackboard. The system consists of a user model builder, a query model builder, a thesaurus expert, a search expert (for suggesting statistics-based search strategies), a browser expert, and an explainer. Chen and Dhar's METACAT [Chen and Dhar 1991]incorporates several human search strategies and a portion of the Library of Congress Subject Headings (LCSH) for bibliographic search. The system also includes a branch-and-bound algorithm for an automatic thesaurus (LCSH) consultation process.

Despite successes in numerous domains, the development process for knowledge-based systems is often slow and painstaking. Knowledge engineers or system designers need to be able to identify subject and classification knowledge from some sources (usually some domain experts) and to represent the knowledge in computer systems. The inference engines of such systems, which mainly emulate human problem-solving strategies and cognitive processes [Chen and Dhar 1991], may not be applicable across different applications.

III. Learning Systems for IR

Our proposed research of concept protocols is also relevant to learning systems, such as neural networks and symbolic learning. However, we also plan to incorporate existing knowledge structures (e.g., Inspec thesaurus) created by domain experts.

Neural Networks

Neural networks computing seems to fit well with conventional retrieval models such as the vector space model [Salton 1989] and the probabilistic model [Maron and Kuhns 1960]. The work of Belew is probably the earliest connectionist model adopted in IR. In AIR Belew 1989], he developed a three-layer neural network of authors, index terms, and documents. The system used relevance feedback from its users to change its representation of authors, index terms, and documents over time. The result was a representation of the consensual meaning of keywords and documents shared by some group of users. Kwok [Kwok 1989] also developed a similar three-layer network of queries, index terms, and documents. A modified Hebbian learning rule was used to reformulate probabilistic information retrieval.

While the above systems represent information retrieval applications in terms of their main components of documents, queries, index terms, authors, etc. other researchers used different neural networks for more specific tasks. [Lin, Soergel and Marchionini 1991] adopted a Kohonen network for information retrieval. Kohonen's feature map, which produced a two-dimensional grid representation for N-dimensional features, was applied to construct a self-organizing (unsupervised learning), visual representation of the semantic relationships between input documents. In [MacLeod and Robertson 1991], a neural algorithm developed by MacLeod was used for document clustering. The algorithm compared favorably with conventional clustering algorithms.

Chen et al [Chen and Lynch 1992] [Chen, Lynch, Basu and Ng 1993] [Chen and Ng 1995] reported a series of experiments and system developments which generated an automatically-created weighted network of keywords from large textual databases and integrated it with several existing man-made thesauri (e.g., LCSH). Instead of using a three-layer design, Chen's systems developed a single-layer, interconnected, weighted/labelled network of keywords (concepts) for ``concept-based'' information retrieval. A blackboard-based design which supported browsing and automatic concept exploration using the Hopfield neural network's parallel relaxation method was adopted to facilitate the usage of several thesauri [Chen, Lynch, Basu and Ng 1993]. In [Chen and Ng 1995] the performance of a branch-and-bound serial search algorithm was compared with that of the parallel Hopfield network activation in a hybrid neural-semantic network (one neural network and two semantic networks). Both methods achieved similar performance, but the Hopfield activation method appeared to activate concepts from different networks more evenly.

Symbolic Learning

Despite the popularity of using neural networks for information retrieval, we see only limited use of symbolic learning techniques for IR. In [Blosseville, Hebrail, Monteil and Penot 1992], the researchers used discriminant analysis and a simple symbolic learning technique for automatic text classification. Their symbolic learning process represented the numeric classification results in terms of IF-THEN rules. Text classification involves the task of classifying documents with respect to a set of two or more predefined classes [Lewis 1992]. A number of systems were built based on human categorization rules (a knowledge-based system approach) [Rau and Jacobs 1991]. However, a range of statistical techniques including probabilistic models, factor analysis, regression, and nearest neighbor methods have been adopted [Lewis 1992] [Massand, Gordon and Waltz 1992] [Blosseville, Hebrail, Monteil and Penot 1992]. Fuhr et al.[ Fuhr et. al. 1991] adopted regressions methods and ID3 for their feature-based automatic indexing technique. Crawford, Fung, and their coworkers [Fung and Crawford 1990] [Crawford, Fung and Appelbaum 1991] [Crawford and Fung 1992] have developed a probabilistic induction technique called CONSTRUCTOR and have compared it with the popular CART algorithm [Breiman, Friedman, Olshen, and Stone 1984]. Their experiment showed that CONSTRUCTOR's output is more interpretable than that produced by CART, but CART can be applied to more situations (e.g., real-valued training sets).

In (Chen and She 1994), Chen and She adopted ID3 and the incremental ID5R algorithm for information retrieval. Both algorithms were able to use user-supplied samples of desired documents to construct decision trees of important keywords which could represent users' queries.

Despite of the diversity in techniques, many recent systems are hybrid in nature, adopting different techniques for different purposes, e.g., a natural language front-end for query processing, a human-generated thesaurus for query expansion, and a machine learning component for inductive relevance feedback. 

 

 

 

INTERSPACE PROPOSAL MAIN PAGE

1. INNOVATIVE CLAIMS

2. DELIVERABLES

3. STATEMENT OF WORK

4. DESCRIPTION OF RESULTS,
PRODUCTS, TRANSFERABLE TECHNOLOGY,
AND TECHNOLOGY TRANSFER PATH

5. COLLABORATIONS

6. SCHEDULE AND MILESTONES

7. TECHNICAL RATIONALE, APPROACH, AND PLAN

8. COMPARISONS TO OTHER RESEARCH

9. KEY PERSONNEL

10. PREVIOUS ACCOMPLISHMENTS

11. BIBLIOGRAPHY

 

    ........