...
.....

The purpose of this proposal is build and evaluate a complete prototype of an analysis environment based on scalable semantics. First, algorithms for scalable semantics will be developed, which together will form a suite of components for semantic interoperability. Each algorithm will be evaluated on a large collection well suited to its particular functionality. Second, an analysis environment will be developed, which integrates the semantic services provided by the algorithms into a coherent framework. This environment will enable all the different services and indexes to be used together to support searching within and across repositories. Finally, a complete prototype will be constructed, using the environment to manipulate the collections. This Interspace prototype will support interactive correlation across object types and subject domains, thus enabling experiments with information infrastructure for routine analysis.

Our suite of components reproduces automatically, for any collection, equivalents to all of the standard library indexes. Some of these indexes represent abstract spaces for concepts and categories above concrete collections of objects and units. A "concept space" records the co-occurrence between units within objects, such as words within documents or textures within images. Much like a subject thesaurus, it is useful for suggesting other words while searching (if your specified word doesnÌt retrieve desired documents, try another word which appears together with it in another context). A "category space" records the co-occurrence between objects within concepts, such as two documents with significant overlap of concept words. Much like a classification scheme, it is useful for identifying clusters of similar objects for browsing (to locate which subcollection should be searched for desired items).

Some of the semantic indexes are the automatic equivalent of standard library object indexes rather than collection indexes as above. A "meta-data" records the concepts that a given object is related to, such as words describing a document or textures describing an image. Much like subject descriptors, it is useful for retrieving an object by searching for concepts contained in the metadata but not in the object contents. A "meta-map" records the categories that a given concept is related to, such as which concepts in which categories are similar to this concept. Much like vocabulary switching, it is useful for searching across subject domains (by suggesting similar terminology in a foreign subject to that already known in a native subject).

The concept spaces and category spaces will be tested across object types for both text and image. The text collection will be from Compendex and Inspec -- it consists of 10M journal abstracts from across 1000 community repositories spanning all of engineering. The image collection will be from the Map and Imagery Library at the University of California at Santa Barbara consisting of 10K aerial photographs and satellite images of the surrounding southern California area comprising some 1M texture units when digitized. There is also a spatial gazetteer to serve as a bridge between image and text, since it gives spatial coordinates with corresponding names of objects that can be located within abstracts from Compendex or from Georef (geography journals). These collections will be adequate to demonstrate semantic mapping across object types of image and text [Chen, Smith, et al 1997].

The meta-data generation for interoperable repositories will be tested using the personal collections of the developers and colleagues at NCSA. This is the test of computer-assisted user classification of objects for small collections -- the simulation of the typical community repository maintained by a subject expert who is not a professional indexer. These collections will include such object types as HTML and Word/Powerpoint.

The meta-map generation will be the most direct tests of semantic interoperability (computer-assisted user queries across subject domains). The purely automatic statistical approach will be tested with the collections above, such as the multiple related domains across engineering for text or the semantic retrieval across image and text collections using the geography maps.

A mixed approach, however, is likely to prove most effective during the proposal period. Medicine is chosen as the subject domain with the best manual indexing which can augment automatic semantic mapping. Concept spaces for 1M journal abstracts each from CancerLit (cancer articles, part of Medline) and from Toxline (toxicology articles from Medline and from Biosis) will be computed. The UMLS (Unified Medical Language System) from the National Library of Medicine contains mappings between the terms across some 40 medical thesauri. Although the thesaurus terms do not cover all of the concepts, their mappings can provide a significant boost to switching concepts across categories. Semantic mapping experiments will consist of interactively switching terms in one concept space to those within another as a suggestion phase before search.

Analysis environments based on scalable semantics can provide significant interactive support for information correlation. We have designed an architecture for an Interspace environment and are beginning to implement the kernel. This environment will enable generation of interoperable repositories with semantic indexing and semantic retrieval within and across these repositories.

We propose to implement a complete analysis environment supporting semantic information management. Much of this effort involves building semantic services by integrating the scalable semantic algorithms into the architectural framework. This environment will be a prototype of the generation beyond next, and thus assumes the existence of global distributed objects. The kernel implementation will be based on Smalltalk front end and Versant back ends with CORBA as the intercommunication medium. The architectural framework can thus be thought of as CORBA-like interfaces for semantic information management.

The environment will support both browsing of existing information (such as correlating across sources using concept mapping) and sharing of new information (such as indexing created repositories using concept spaces) for community information distributed across the network. Incorporation of the semantic services will provide a full range of indexing and search across object types and subject domains. By combining the collections from the algorithm development and the software from the systems development, a full-scale prototype of the Interspace will be created. It will be used on a daily basis by the developers and colleagues, and available for distribution on an experimental basis to the broader DARPA community. Specific

Deliverables

1. Algorithm development for scalable semantics: concept spaces, category spaces, meta-data generation, meta-map generation

Scalability Tests for:

A. Large professional repositories across object types (engineering text & geography image)
B. Small personal repositories with user classification (web documents and slides)
C. Search correlation across subject domains (medical abstracts with thesaurus meta-maps)

2. Systems prototype for analysis environment: semantic indexing, semantic retrieval, multi-view interface, semantic federation

Evaluation Tests by:

A. Architecture framework based on scalable semantics
B. Implementation built on simulation of Internet-wide distributed objects
C. Functionality of index and search across object types and subject domains 

 

 

 

INTERSPACE PROPOSAL MAIN PAGE

1. INNOVATIVE CLAIMS

2. DELIVERABLES

3. STATEMENT OF WORK

4. DESCRIPTION OF RESULTS,
PRODUCTS, TRANSFERABLE TECHNOLOGY,
AND TECHNOLOGY TRANSFER PATH

5. COLLABORATIONS

6. SCHEDULE AND MILESTONES

7. TECHNICAL RATIONALE, APPROACH, AND PLAN

8. COMPARISONS TO OTHER RESEARCH

9. KEY PERSONNEL

10. PREVIOUS ACCOMPLISHMENTS

11. BIBLIOGRAPHY

 

    ........