| ..... | ||||
| |
... | |||
| ..... |
[Current plan] - [Technology Transition]- [Quad Chart] Objective: To design, implement, validate, and package an analysis environment prototype, which achieves semantic interoperability scalable across subject domain, media type, and collection size. Approach: The Interspace Prototype may be viewed as a collection of semantic indexes organized into hierarchical domains of knowledge accessible within an advanced analysis environment. Underlying these domains are actual physical repositories of objects. The approach thus develops automatic techniques for semantic indexing integrated into the analysis environment. Indexing techniques include concept identification/extraction, concept co-occurrence, concept assignment, concept switching, object categorization, and semi and fully automated generation of hierarchies of knowledge. The analysis environment consists of a kernel based on persistent information units in an applications environment which includes searching, linking, access control, path recognition and matching, and access to objects external to the system. A brief overview of test collections is in order prior to going into detail on specific approaches to semantic indexing and analysis. To demonstrate scalability across subject area, the contractor is conducting indexing and analysis experiments on collections encompassing the natural and applied sciences (e.g., biology, medicine, physics, geography, geology, petrology, aircraft manufacturing, civil engineering, industrial engineering, mechanical engineering, electrical and computer engineering, software engineering, computer science). To demonstrate scalability across media type, the contractor is performing experiments with collections in the diverse media domains of text, photographic images, and numerical data including satellite imagery (AVHRR). To demonstrate scalability in collection size, the contractor is conducting experiments on collections ranging across subject area and media type from 10 items to 10 million items in size. Approaches to concept identification/extraction can be broken down by media type. In the textual domain, the contractor is developing heuristic approaches to phrasal and noun phrasal extraction including the use of AI techniques based on parts of speech tagging. In the domain of digitized photographs, the contractor is employing Gabor filters to perform feature extraction followed by flow analysis for use in region formation. In the domain of numerical data, the contractor has developed custom heuristics to extract temperature and vegetation as conceptual units from channels in AVHRR data. Approaches to object categorization and the determination of concept-concept relationships include the automatic computation of Category Map and Concept Space indexes. The contractor is investigating algorithms for object categorization, including variants on the Kohonen Self-Organizing Map and Multi-Dimensional Scaling. Implementations of these algorithms are being used to generate content-based textual and visual classifications (the latter termed visual thesauri) for large numbers of multi-media items. Algorithms to compute Concept Space indexes for term suggestion are based on prior research, and the contractor is actively optimizing both these and other algorithms to compute Category Map and Concept Space indexes on demand in an analysis environment supporting dynamic semantic indexing. Dynamic semantic indexing promises to be useful in interactive environments such as the electronic command post of the future. Approaches to automatic assignment of key concepts to objects (e.g., keyword assignment to text abstracts, labels to Category Map regions) are based on convergent spreading activation in a Hopfield network. The contractor is also applying this technology to the problem of mapping from one conceptual cluster to another in mapping concepts between domains (e.g., concept/vocabulary switching). Other approaches to concept switching include the determination of regions of conceptual density (i.e., semantic locality) in Concept Spaces based on co-word analysis as well as the application of automatic techniques of object categorization to Concept Spaces - e.g., creating a Category Map of a Concept Space. The contractor is also exploring the creation of inclusive domain hierarchies formed by the aggregation (in ancestor nodes) of the content in descendent sub-graphs in order to bootstrap the process of concept switching. Approaches to the generation of hierarchies of domains include semi-automatic methods, which are based upon existing, manually generated hierarchies. Examples include the automatic assignment of objects to the MeSH and SNOMED medical classification systems. The contractor is also pursuing the fully automatic generation of hierarchical domains of knowledge based on collection content. Approaches under investigation include the recursive generation of object categorizations using the Kohonen Self-Organizing Map algorithm. The Interspace Prototype Analysis Environment is based upon a unified object model called an information unit or IU. The IU is the fundamental structural unit of all the objects in the kernel environment. All IUs have the ability to display, link, and index themselves. Objects in the analysis environment are associated with one or more view IU objects which provide the means by which an object is manipulated. Multiple simultaneous views of objects can be embedded into any number of independent views of a given IU. A mechanism exists for updating the independent views of an IU when one view changes the underlying object. Linking between IUs involves one or more IUs or sets of IUs placed in the link list of an IU. The system provides a default list which is publicly readable and writable. Linking an IU appears to the user to be embedding the object inside another object. The linking system is completely independent of the inheritance, versioning and domain systems (i.e., links are not used to represent inheritance, etc.) Indexing and searching are closely related operations in the analysis environment. An IU indexes itself and this process in turn allows the IU to be located during a search. IU objects also have a base set of properties associated with them, including ownership, accessibility, and versions. IUs can also be units of functionality (termed method IUs) which can be invoked as if the IU were an independent program. In terms of infrastructure, components of the Analysis Environment are built according to a layered strategy. Each layer corresponds to a particular role or task required by the analysis environment, and objects are able to implement one or more of these roles. The system model layer reflects the actual underlying implementation. The user semantic model layer assures that the user is presented inter-object relationships which preserve semantics. For example, if a user should ask an object for a list of categories in which that object appears, a list of category objects and not just a series of string labels are returned. The interaction/presentation layer (I/P) provides additional capabilities which are dependent upon display technologies available in the environment. The media layer defines various simple generic media, such as List, Text, or Graphics. Corresponding to each such media, there exists a set of objects in the role of an I/P wrapper which may be presented on that media. Media are not specific to any particular application. Although the goal of the analysis environment is to emphasize the primacy of the user semantic model, there remains a use for application-specific browsers. These are intended to provide for application-specific code making use of generic Media components, and form the final browser layer. An example is a Browser having both List and Text media, with the List used to present search results and the Text to present the body of a textual document selected from the List. The contractor has also begun the design and implementation of a path recording and matching service within the Interspace Analysis Environment. Path recording provides a mechanism for capturing, representing, and making persistent the activities of Interspace users. Path matching provides a mechanism for allowing users to locate prior work within the Interspace that may be similar in nature along one or more dimensions, such as topic, time frame, purpose, etc. The contractor's approach to evaluation of the Interspace Prototype includes manual precision/recall experiments complemented by semi-automatic methods of validation. The latter is a topic of renewed interest on the part of the digital library research community in light of the tremendous increase in the amount of digitally available materials. As an example of semi-automatic validation, the contractor has developed a methodology for comparing Concept Spaces. Treating the Concept Spaces as weighted graphs, a semantic difference or distance measure can be computed between two collections. This approach is used in determining the correctness of optimizations (e.g., parallelization) to the Concept Space algorithm. Such measures are used in conjunction with "gold standard" indexes established by long practice, user evaluation, or domain experts. Our CANIS Laboratory at UIUC organized a specialty workshop in June 1998 at the ACM Conference on Digital Libraries '98 on the topic of Metrics for Digital Libraries. This workshop was co-sponsored by the DARPA D-Lib Metrics Working Group. There were presentations by: Jim Thomas (Battelle), Paul Kantor (Rutgers), Ed Fox (VTI), Jim French (UVa), Michael Ortega (UIUC), and William M. Pottenger (CANIS Lab, UIUC)
Plans for FY99 include the following: The current computation of visual thesauri will be completed for 1000 aerial photographs, providing semantic indexing on 1M image regions. Additional computations will also be made on AVHRR data, and the results integrated into an improved analysis environment in the Interspace Geographic Information System. This innovative environment and technology holds genuine promise for military applications beyond disaster relief. The contractor will continue conducting experiments in the automatic generation of domain hierarchies, computation of concept spaces, automatic assignment of key concepts, and research in techniques for concept switching for large collections in the domains of medicine and biology. These experiments will include providing category maps for an entire discipline-scale collection, MEDLINE with 10M abstracts. All semantic indexes on text collections will be generated using the integrated analysis environment prototype, as part of its continuing evolution. Experiments will also be run with on-demand dynamic semantic indexing, using the environment to provide indexing interactively. When mature, there are significant defense applications of this technology in automated battlefield analysis. The contractor will continue research into scalable techniques of validation, as well as continue conducting evaluations based on established measures of precision and recall. The major technology transitions in FY98 have involved the integration of noun-phrasing technology, automatic categorization technology for text, and high-performance HDF geoindexing infrastructure for simulated sensor data into the DARPA/ISI GeoWorlds disaster relief scenario demonstration system.
|
|
|
|
|
|
..... | ..... | ||
|
..... |
||||