| ... | |||
| ..... |
RATIONALE The Net will be radically different ten years from now. There will be a billion repositories spread around the world. These repositories will contain the knowledge of communities ranging in size from a few co-located individuals to giant distributed organizations. Problem solving using the Net will be an everyday occurrence. There will be analysis environments enabling cross-correlation of information from a wide variety of sources across different subject domains. Correlation will become the fundamental operation within the Net. There will be powerful facilities for dynamically grouping people and information. These advanced capabilities require a new information infrastructure supporting user-centered semantic information retrieval systems. This proposal is about bringing to fruition 10 years of research on developing such an infrastructure.
The functionality of the Net can be viewed in waves as Figure 1 illustrates. In the upswing of a wave, the fundamental research is being done for a new level of functionality. Prototype research systems begin in the trough and evolve into mass commercial systems in the peak. The functionality of this wave is then polished and finalized in the downswing period before the start of the next wave. The First Wave was the level of access, of transmission of data. It began with the coming of the ARPANET and evolved through large-scale distributed file systems such as AFS, roughly 10 years on the upswing and 10 on the down. The Second Wave is the level of organization, of retrieval of information. It began with distributed multimedia network information systems, such as PI Schatz's Telesophy system, which was featured in his invited talk at the 20th Anniversary Symposium for the ARPANET in 1989 as the example of technology which might make possible future worldwide information spaces. That same year saw the initiation of the World-Wide Web project at CERN, which when coupled with the NCSA Mosaic interface became the technology that brought global information spaces to the world at large. As one result of his work on the Telesophy system, Schatz became the scientific advisor for information systems to NCSA throughout this entire period, in both the pre and post Mosaic eras. The information wave took about 10 years to peak and should continue for another 5 in the consolidation phase, through the near future period of distributed objects and Internet-wide operating systems. The Third Wave will bring the level of analysis, of correlation of knowledge. It will move past search of individual repositories, beyond federation across repositories, to analysis of diverse groups of information across sources and subjects. To develop the new technology needed for this new wave 10 years hence, one needs to begin now so that widespread research prototypes can be available for the new millennium supporting global semantics. The project proposed here will develop on-going research into a fully fledged analysis environment and perform large-scale tests of its scalability and effectiveness. To enable concentration on the new functionality, we will simply assume that the information wave is complete and that the distributed objects exist, then develop revolutionary architectures for the protocols and environments underlying the Third Wave. In this third wave, the Interspace, there will be distributed services to manipulate concepts across domains just as the ARPANET had distributed services to transfer files across machines and the Internet is having distributed services to transfer objects across repositories. We propose an architecture for the Interspace environment which supports fundamental manipulation of concept spaces: indexing and retrieval, grouping and sharing. The research will also implement a prototype analysis environment to handle the community repositories and personal information, and test this environment on small sample user populations. At the end of the contract period, the software environment and the test repositories would be available for technology transfer to the ARPA community and other interested parties. The third wave, the Interspace, will reach widespread usage in the early part of the 21st century, 10 years from now. This wave will support correlation across multiple information sources and create a global concept space relating together the knowledge in the billions of community repositories. The Interspace will be constructed above information infrastructure that supports connection of spaces, just as the Internet is constructed upon infrastructure that supports connection of networks. Correlating information in the world of a billion community repositories will require direct interaction with spaces that enable users to manipulate concepts rather than objects. It is possible today for a revolutionary systems project to build a research prototype of the future Interspace [Schatz, 1997]. Analysis environments provide basic infrastructure support for interactive correlation of information from multiple sources across the Net. We propose building such a prototype of a complete information infrastructure upon which powerful user-centered analysis environments can be constructed. This infrastructure will be based upon abstract spaces for concepts and categories, which provide semantic correlation at levels above the concrete objects and networks. The foundational research for the underlying technology has been carried out within the Digital Library Initiative (DLI) project at the University of Illinois, partially funded by ARPA ITO, with the same senior personnel as here. The implementation will use existing advanced software technology such as Smalltalk and CORBA to simulate the everyday world of distributed objects ten years hence. The Interspace infrastructure is intended to be universal for all repositories of all types. Its semantics is thus based upon statistical clustering techniques, arising from information retrieval and image processing research. These techniques support scalable semantics, where the clustering works automatically to index all objects for all collections. In particular, the concept spaces utilize co-occurrence frequency of terms within objects, while the category spaces utilize co-occurrence frequency of objects within repositories. Our major research claim is that concept spaces are the most effective generic level of interoperability for semantic manipulation of object repositories. At the heart of semantic interoperability is the ability to extract the meaning of diverse materials and map "semantics" uniformly across these. Concept Spaces [Chen, Schatz 1996a,1996c] are collections of abstract concepts generated from concrete objects. The concepts are typically labelled with text phrases and the collections have traditionally been text documents. However, except for the generation operations, the logical concepts are independent of the physical objects they represent. Generalized vocabulary switching across concept spaces is thus a concrete approach to semantic interoperability. These statistical indexes are similar to those which grew out of the DARPA TIPSTER project, which also addressed information retrieval on large collections. In particular, our proposed concept space techniques are similar to the INQUERY PhraseFinder database in principle. However, several significant improvements have been made in our research. Firstly, our concept space techniques do not need to have any a priori default probabilities (which are difficult to obtain) and thus are more scalable and efficient. Secondly, several domain-independent noun phrase extraction techniques have been developed in our research, which include phrase extraction automatic indexing techniques (up to 5-word phrases, unlike the two-word phrases in INQUERY) and natural language-based linguistic noun phrase extraction techniques (any numbers of words for syntactically-correct phrases). Currently, we have experimented with the MIT Media Lab Chopper (based on Princeton's WordNet), the Brill parser (University of Pennsylvania), and the NPTool parser (Helsinki University). Thirdly, none of the TIPSTER systems are designed to perform categorization or classification of topics like our proposed category map algorithms. Categorization, however, is an analysis process critical to vocabulary switching and semantic mapping. Fourthly, our statistical algorithms are intended to be universal and have already been tested on large collections across a wide variety of text with, for example, different subjects (biology and engineering) and different types (journal articles and web pages). In addition, unlike TIPSTER research which was designed for textual applications only, the proposed research is designed for a general and scalable analysis environment for both textual and image analysis. Several texture and region based image indexing techniques will be integrated with the concept space and category map algorithms. In summary, we believe our proposed research is more scalable (in domains, media types, techniques, and sizes) than previous TIPSTER research, and yet it extracts semantics at a level suitable for large-scale algorithmic computation. Such a "scalable semantics" approach is unique and will be able to help create a powerful concept-based analysis environment. Finally, the proposed concept space and category map techniques have been optimized for efficient large-scale computations (based on our earlier work in high-performance computing). Several versions of the proposed techniques will be available for batch processing, incremental update, and real-time on-the-fly term suggestions. We intend to continue to perform high-end supercomputer computations to simulate the personal computers of 10 years hence. These simulations will test parallel optimization techniques and identify limitations of current architectures. We intend to use these tests to guide design requirements for special-purpose machines that can perform real-time semantic indexing and categorization. The implementations of such designs may be very significant practically for real-life military situations. APPROACH Our research concentrates on building information analysis environments for distributed object repositories. In particular, the new infrastructure will support semantic retrieval and semantic mapping across object types and subject domains. The Interspace framework supports integrated interoperability of semantic services based on statistical algorithms for information management. Our suite of statistical algorithms will not only be complete (in the sense of mimic-ing all the standard manual classification techniques with automatic procedures) but also computationally feasible. We have done the foundational research identifying suitable algorithms for: concept spaces (co-occurrence matrices), category spaces (Kohonen maps), meta-data generation (Hopfield nets), and meta-map generation (spreading activation). The research in this proposal will evaluate each algorithm on a suitably large collection and evolve the implementation until adequate speed and reliability is reached. We propose a set of major experiments to test the scalability of the Interspace prototype. Each experiment will test a part of the information infrastructure using the high-performance technology now which will be the everyday technology ten years hence. These information infrastructure experiments will be performed using the unique advanced computational infrastructure at the new NCSA (National Computational Science Alliance). On the repository end, large collections will be used to simulate the world of a billion community repositories. For text objects (technical documents), we have already assembled 1000 repositories across all of science and engineering from 10,000,000 journal abstracts. For image objects (spatial maps), we are assembling 100 repositories across Southern California from 10,000 aerial photographs and satellite images. Experiments with large collections of both text and image will test whether concept spaces and category maps are effective for interactive retrieval across object types. The integration provided by the environment will enable users to issue composite queries which require multiple correlations between multiple sources to answer. Examples of queries we expect that the Interspace prototype will be able to answer after the geography databases are incorporated include the following: "Find me a up-scale residential area in Santa Barbara county which was not flooded in 1994" "Find me information about orchards along the Santa Cruz river in Arizona" "I plan to move my operation to Los Angeles county. Find me a site close to major highways but with a lot of green. It should have existing parking lots nearby and be close to city buildings. Hopefully, it will also be close to residential areas and schools."We will also be running a staged set of experiments for the technology of semantic mapping. The current experiments do space intersection, moving across spaces on common terms. The first stage past this will use neural nets to perform spreading activation, to find similarity matches across spaces, based on several traversals along the graphs of the concept spaces. The next stage will match clusters of terms to clusters of terms, rather than single to single terms. Finally, manual boosting of mapping across domains will be attempted in the medical domain, where extensive maps of terms across thesauruses exist (not all the terms but enough to give a good boost to the automatic matching). In the world of ten years hence, the user will be able to fly through abstract spaces, record paths of selected knowledge, and use the Interspace to find similar paths for examination. To test the feasibility of path matching on the user end, we will use the largest supercomputers at NCSA as time-shared machines to compute the semantic clustering on-the-fly. This will enable testing of dynamic categorization, where user preferences for classifying large materials from multiple collections can be indexed interactively. In contrast, the collections above use static human-generated category hierarchies or automatic category spaces generated off-line. Semantic Retrieval Algorithms Concept spaces are collections of abstract concepts which are generated from concrete objects. The concepts are typically labeled with text phrases and the collections have traditionally been text documents. However, except for the generation operations, the logical concepts are independent of the physical objects they represent. For example, in molecular biology, the same gene concept occurs in literature text, gene descriptions, map locations, sequence codes, lineage trees, developmental images, and physical clones. The difficulties in WCS in supporting federation across information sources were semantic, i.e., the ability to map the same concept across its occurrence in different objects, rather than syntactic, i.e., the ability to execute the same operation such as query across multiple sources. We will initially be testing semantic retrieval algorithms with text documents, since the technology for this medium is sufficiently mature that automatic generation of concepts can produce large enough collections to test scalability. Nonetheless, this is only due to the necessity to simulate future technology with current data. The algorithms for interactive retrieval, for example, are based on the frequency of objects occurring together in context. These would work just as well with images as with text, and in fact we are additionally experimenting with concept spaces for texture classification in spatial data (GIS) as part of a collaboration between the Illinois and the Santa Barbara DLI projects. (To see whether the term suggestion techniques which appear effective with text phrases [Schatz, Johnson, and Cochrane 1996] are also effective for other objects.) The major operations for concept spaces deal with categorization, the grouping of concepts particularly across subject domains and community boundaries. Intersection of concept spaces is the route towards vocabulary switching, mapping of terminology from one domain to another. Our preliminary results show that vocabulary switching is possible on an interactive basis, as with suggestion of alternative terms for information retrieval [Chen, et. al., and Schatz 1996a]. That is, future technology will support augmentation but not automation of human performance in crossing domain boundaries. We believe concept spaces are a significant step towards true semantic interoperability, the Grand Challenge of the digital library research agenda [IITA 1995]. Our implementation of the protocols will concentrate on testing the scalability and evaluating the effectiveness.The specific algorithms to be examined in this research include: concept spaces, vocabulary switching, and category maps. Selected techniques have been tested in the Illinois engineering testbed (e.g., concept space [Chen, Schatz, Martinez, and Ng 1994] [Chen, Schatz, Yim, and Fye 1995] [Chen and Ng 1995] [Schatz, Micho, Cole, Hardin, Bishop, and Chen 1996]) and in the Alexandria project (e.g., texture extraction [Manjunath and Ma 1995] [Smith 1996]). However, significant modification and experimentation will be needed for this large-scale interoperability experiment. Concept Spaces For each online document (text or image), we need first to identify the "concept indexes" to represent its content. In the Illinois DLI project, we adopted an automatic phrase indexing technique which includes dictionary look-up, stop-wording, and adjacent term phrase formation. The algorithm first identifies individual words. Then, a stop-word list is used to remove non-semantic bearing words (e.g., the, a, on, in). After removing the stop words, a similar procedure to remove verbs and adverbs was also adopted. Finally, term-phrase formation was accomplished by combining adjacent words. We found that our manually-crafted stop words, stop verbs, and stop adverbs can be used across different domains and that they help create terms that are predominantly noun phrases (usually one or two adjectives, followed by a noun). In addition, as mentioned in Section H, we have experimented with the MIT Media Lab Chopper (based on Princeton's WordNet), the Brill parser (University of Pennsylvania), and the NPTool parser (Helsinki University), and are evaluating these systems. For geo-referenced photos, maps, and images, we plan to use Gabor wavelet features, developed by the imaging group of the Alexandria project [Manjunath and Ma 1995] [Smith 1996], for texture analysis. The experimental results are promising. The Gabor features compare favorably with those of other existing texture analysis algorithms. After texture indexing, each texture will be treated as a unique identifier (similar to a noun phrase in a document) for subsequent analyses. After concepts (terms or textures) are identified in each document, we will proceed to perform concept co-occurrence and similarity analysis based on the asymmetric "Cluster Function" developed by Chen and Lynch [Chen and Lynch 1992]. We have shown that this asymmetric similarity function represents concept association better than the popular cosine function.
These two equations indicate the cluster weights, or similarity, from term Tj to term Tk (the first equation) and from term Tk to term Tj (the second equation). dij and dik are the product of term frequency and inverse document frequency and are defined in a similar manner (this is a variation of the popular tf*idf measure used mostly in text-based vector space modeling where tf is concept frequency and idf is inverse document frequency [Salton 1989]). dij, for example, is defined as:
where N represents the total number of documents in the collection, tfij is the frequency of occurrence of term Tj in document i, dfj is the number of documents (across the entire collection of N documents) in which term Tj occurs, and wj is the number of words in term Tj. dijk and dikj represent the combined weights of both terms Tj and Tk in document i and are also defined in a similar manner. dijk, for example, is defined as follows:
Here tfijk represents the minimum number of occurrences of term Tj and term Tk in document i. dfjk represents the number of documents (in a collection of N documents) in which terms Tj and Tk occur together. The final expression, wj, is the number of words in term Tj. In order to penalize general terms which appear in many places in the co-occurrence analysis, the authors develop a weighting scheme similar to the inverse document frequency function. Tk, for example, has the following weighting factor:
Terms with a higher value for dfj (i.e., more general terms) have a smaller weighting factor, which results in a lower similarity. Co-occurring terms are ranked in decreasing order of similarity, with the result that more general terms occur lower in the list of co-occurring terms. Vocabulary Switching (Concept Mapping) In addition to the user-controlled concept space and/or thesaurus browsing process, searchers can also invoke selected spreading activation algorithms for multiple-concept, multiple-link concept suggestions. The system-generated concept space and other human-generated knowledge structures (e.g., thesauri, semantic networks) can be merged as a large concept graph (akin to a human problem space as defined in human information processing theory [Card, Moran, and Newell 1983]). In [Chen, Lynch, Basu, and Ng 1992], we have proposed a general algorithmic solution to merging probabilistic (i.e., concept space) and symbolic (i.e., thesauri and semantic networks) graphs. We have also developed two spreading activation algorithms, based on the serial branch-and-bound algorithm and the parallel Hopfield net algorithm, respectively [Chen and Ng 1995]. The Hopfield algorithm, in particular, has been shown to be ideal for concept-based associative retrieval. Each concept (a phrase or a geo-referenced texture) in the integrated probabilistic/symbolic networks can be treated as a neuron and the asymmetric weight between any two concepts can be taken as the unidirectional, weighted connection between neurons. An ad hoc procedure of translating symbolic links to probabilistic links is needed as described in [Chen, Lynch, Basu, and Ng 1993]. Using user-supplied concepts as input patterns (e.g., "Find me information about orchards along Santa Cruz river in Arizona?"), the Hopfield algorithm activates their neighbors (strongly associated geo-phrases, e.g., Tempe, Phoenix, Tucson, and Wilcox, and related image textures of pecan farms and apple farms), combines weights from all associated neighbors (by adding collective association strengths), and repeats this process until convergence. During the process, the algorithm causes a damping effect, where concepts farther away from the initial concepts receive gradually decreasing activation weights until activation eventually "dies out." This phenomenon is consistent with the human memory spreading activation process [Card, Moran and Newell 1983]. The Hopfield net algorithm relies on an activate and iterative process, where
is the activation value of neuron (concept) j at iteration t+1, tij is the co-occurrence weight from neuron i to neuron j, and fs is the continuous SIGMOID transformation function, which normalizes any given value to a value between 0 and 1 [Knight 1990] [Dalton and Deshmane 1991]. This formula shows the parallel relaxation property of the Hopfield net. After the convergence process, the activated concepts reveal the semantically most relevant concepts for user queries. New concepts will be suggested to users for selection or for further concept spreading activation (an iterative process). Category Maps A scalable multi-layered, graphical self-organizing map (SOM) approach to textual and image (object) categorization was developed in our previous research [Chen, Schuffels, and Orwig] [Orwig, Chen, Nunamaker 1997] [Chen, Smith et. al. 1997]. In the algorithm's basic form, continuous-valued vectors are presented sequentially without specifying the desired output. After enough input vectors have been presented, network connection weights will specify cluster or vector centers that sample the input space such that the point density function of the vector centers tends to approximate the probability density function of the input vectors. In addition, the connection weights will be organized such that topologically close nodes are sensitive to inputs that are physically similar. The algorithm proceeds as follows:
Use the most significant N features from all objects (text or image) as the input vector and create a two-dimensional map (grid) of M output nodes (e.g., a 20-by-10 map of 200 nodes). Initialize weights from N input nodes to M output nodes to small random values.
Represent each object by a vector of N features and present to the system.
Compute distance dj between the input and each output node j using
where xi(t) is the input to node i at time t and wij(t) is the weight from input node i to output node j at time t.
Select winning node j* as that output node with minimum dj. Update weights for node j* and its neighbors to reduce their distances (between input nodes and output nodes).
After the network is trained through repeated presentation of all objects, submit each object to the trained network and assign it to its rightful place in the output map (i.e., the winning node). The resulting map thus represents regions of important objects and the assignment of objects to each region. Regions that are similar (conceptually) will appear in the same neighborhood.
For each map region which contains more than k objects, conduct a recursive procedure of generating another self-organizing map until each region contains no more than k objects. Real-time Semantic Retrieval on Parallel Computers A standard approach to semantic retrieval performs statistical correlations on each subject area to support deeper retrieval. However, true information retrieval involves entire sessions not individual queries. For example, a user first chooses subjects of interest and then retrieves pertinent documents. Next, the user dynamically repartitions the collection for further queries. An ideal retrieval system would thus support dynamic classification during the course of a search session. We have identified automatic techniques for clustering categories from a collection, and then indexing concepts from each of the identified categories. However, dynamic sessions involve identifying a collection on-the-fly as a combination result of choosing existing categories, grouping query results, and examining documents. Supporting automatic techniques such as these would enable users to factor user-identified semantic indices into an interactive retrieval session. For example, the refinement of a query response consisting of 100 to 300 items is considered an important milestone in real-time semantic retrieval [Schatz 1990]. Similarly, there is a growing demand for the application of semantic retrieval techniques on repositories ranging in size from a few thousand to several hundred thousand items [Schatz and Chen 1996]. As yet, the initial milestone has not been reached. However, recent research conducted by us demonstrates that real-time semantic indexing is within grasp. Employing the resources at the National Center for Supercomputing Applications (NCSA), semantic indexing for a data set consisting of 300 items was performed on a parallel multiprocessor in under ten seconds [Pottenger and Schatz 1997b]. In the same vein, it must be noted that the world is heading towards parallel computers and computations. Dr. Richard Wirt, Director of Intel's new Research Lab, emphasized this trend in a recent seminar in which he discussed future directions in microprocessors, multiprocessors, and applications \cite{Wirt96}. This trend is being echoed on the desktop as major providers of personal computer systems integrate multiple processors into workstations. According to Dr. Wirt, the personal computer of the year 2000 will have many times the performance of today's systems. Combined with the trend towards parallel computers, tomorrow's multiprocessor PCs will achieve performance comparable to today's supercomputers. With the advent of parallel supercomputers, semantic retrieval based on statistical techniques known for many years is now becoming computationally feasible in real-time. Experiments in semantic retrieval conducted on today's supercomputers enable the development and implementation of algorithms which will be widely applicable on PC multiprocessors in five to ten years. As such, today's supercomputers represent a "time-machine" which can be effectively employed in the task of developing real-time statistical information retrieval techniques. The development and implementation of real-time parallel semantic retrieval algorithms is, however, a challenging task. This can be seen from recent results in the computation of Concept Spaces on the HP/Convex Exemplar parallel supercomputer \cite{CSNY96}. In this study, an input set of approximately 60,000 abstracts each approximately two kilobytes in size required over 10 hours of computation time on 24 processors. After a careful reimplementation based on the theory of associative loop operators [Pottenger 1997], a computation of greater complexity was completed in less than an hour on only 16 processors in a recent experiment at NCSA. This same theory has been shown to apply to a wide range scientific computations including irregular Fortran codes [Pottenger et. al. 1997]. An ambitious program of research is necessary to compute automatic categorization in real-time to enable fully interactive information retrieval. The initial focus of our research will be on characterizing the computational requirements for several classes of semantic retrieval. This is a necessary first step in determining the requirements for retrieval in real time. In order to accomplish this, a study of the automatic categorization algorithms will first be conducted based on the theory of associative loop operators discussed in [Pottenger 1997]. Based on the results of this study, we plan to implement a parallel C++ version of this algorithm on a distributed, shared-memory (DSM) architecture such as the SGI/Cray Origin2000 at the National Computational Science Alliance (NCSA). This work builds on previous work discussed in [Pottenger and Schatz 1997b} in which algorithms from \cite{ChenLynch92} were analyzed for parallelism based on the model presented in [Pottenger 1997]. Once the analysis and implementation of the automatic categorization algorithms is complete, we will begin the characterization of the performance of the parallel implementations of both the Concept Space generation and Category Map computations in order to make a determination of the computational requirements needed for real-time semantic retrieval. Based on the projected performance of desktop multiprocessors in the next 3-5 years, we will make a determination as to whether special-purpose hardware will be needed to compute problems of this nature in real-time. Initial results indicate that special-purpose hardware will be necessary. Table 1 portrays the performance of cSpace, a semantic retrieval application benchmark [Pottenger and Schatz 1997a], for an input set of full-text documents from an electronic collection located at the Grainger Engineering Science Library at Illinois. The computation was completed in dedicated mode on a shared-memory SGI PowerChallenge multi-processor. The results are contrasted with a similar experiment conducted using abstracts. Table 1: Abstracts vs. Full-Text Concept Spaces for 300 Documents The abstracts average 2-3KB in size, and the full-text documents approximately 20-30KB. The columns record elapsed (wall-clock) execution time in hours, minutes and seconds (h:m:s) and speedup (Sp). The super-linear speedups in the full-text experiments are an indirect result of the poor performance of multi-threaded dynamic memory allocation in C++ on the SGI PowerChallenge. The parallel version of cSpace used in these experiments employs a customized memory manager which alleviated much of the overhead associated with multi-threaded dynamic memory allocation. However, this also provided an unexpected benefit in that the overhead of numerous calls to malloc (i.e., operator new) was entirely eliminated [Pottenger and Schatz 1996]. The results on four processors indicate that the computation of a Concept Space for full-text semantic indexing involves a factor of almost 300 increase in computational complexity over that required for abstracts. Given that more complete testing confirms this result, it will be necessary to design a new-style PC for semantic computation and visualization. We intend to produce the outline of a design to create the semantic equivalent of graphics and floating point chips in current processors, which will then be used in the machine of the future when semantic retrieval is a standard operation. Thus we will be collaborating closely with expert machine designers to generate the appropriate requirements. Scalability Tests for Semantic Retrieval These expeiments are tests of the range of algorithms just described on large-scale collections suitable for each type of algorithm. The significance is the scale. The Compendex Document Experiment A concept space generation experiment has been conducted recently. The testbed for the experiment was the 1992-1995 Compendex engineering abstracts, with about 2.6M abstracts and 4.1 GBs in size. The Compendex collection consists of 38 high-level engineering categories (domains, e.g., civil engineering) and about 600 engineering sub-categories (communities, e.g. tunnels and bridges). Using a data parallel strategy, we were able to partition the entire collection based the 600 sub-categories and thus produce about 600 concept spaces in various engineering sub-categories. Although sub-category collections and concept spaces vary in size, on an average each concept space contains about 5K concepts (2-word and 3-word noun phrases) and 200K weighted links. Using a divide-and-conquer approach, we were able to generate all 600 community concept spaces using about two days of the CPU cycle on a 24-node Convex Exemplar supercomputer (1 GB shared memory). The Inspec Document Experiment A similar experiment will be performed on a larger and more fine-grained 1989-1995 Inspec collection, which contains about 5M abstracts (7 GBs) in computer science, electrical engineering, and physics, classified by 3000 sub-categories. Based on the same divide-and-conquer approach, we plan to produce 3000 fine-grained computer engineering concepts using about five days of computing cycle on a 24-node Convex Exemplar. If our reformulation of the implementation to do pure shared-memory SMP computation is successful, we will be able to cut this time to a dedicated weekend on the Exemplar at NCSA. The Medical Vocabulary Switching Experiment A CancerLit concept space and category map generation experiment will be conducted in collaboration with the National Center Institute (NCI). The CancerLit collection consists of about 1.5M cancer-specific abstracts (about 4 GBs in size), grouped according to different types of cancer (e.g., breast, lung, etc.). An all-in-one strategy will be compared with the divide-and-conquer strategy to examine the concept space and category map results. Benchmarking with the two strategies on the NCSA SMP supercomputers will be performed. Several refined indexing techniques will also be adopted in this biomedical informatics experiment. Term-phrase formation automatic indexing techniques and NLP noun phrase extraction techniques will be experimented and compared with in the context of medical information retrieval. Concept spaces and category maps generated as a result of adopting automatic indexing and NLP noun phrase extraction will be evaluated by the NCI researchers and cancer clinicians. In addition, we plan to incorporate the NLM's Unified Medical Language System (UMLS) meta-thesaurus as the human-generated knowledge source to supplement our system-generated concept spaces and category maps for semantic indexing and vocabulary switching. The Personal Email Experiment Based on our previous research in collaborative computing and intelligent meeting agents, we plan to generate concept spaces and category maps to classify personal email information. Such a classification process is important for creating user-specific information space and concept space. Selected information visualization techniques (fisheye views and fractal views) will also be tested to represent personal email repositories. Automatic indexing and NLP noun phrase extraction techniques will be modified to accommodate the unique email structures and characteristics. Normalization will also be needed to deal with the non-uniform lengths of personal email. In addition, local passage weighting of terms will also be tested in order to best represent personal concept space. An efficient, real-time version of concept space and category map algorithms will be fully tested for this dynamic (and oftentimes noisy) application. The Alexandria Map Experiment In collaboration with the University of California at Santa Barbara DLI project, several Alexandria geo-referenced collections will be used in our concept protocol experiments. About 11K US aerial photos are currently available online at UCSB Map and Imagery Library (a partner of the Alexandria project). Each photo is about 55 MBs in size, with 6224-by-7687 pixels, 1-meter ground resolution, and is in JPEG PBMT storage, based on 1983 standards. All have been geo-rectified (with longitude/altitude). Each image will be evenly divided into about 40 1000-by-1000 pixels subimages, thereby producing about 440K subimages. At 1-meter resolution, fine-grained geo-concepts such as houses, schools, factories, parks, etc. can be identified. Texture/region extraction and labeling will be performed for content-based image indexing. Corresponding to each US aerial photo, topographical maps are also available from the Alexandria partners (11K+ maps). Each map contains 200+ well-defined topological map symbols, e.g., vegetation, railroad, mine, river, lake, based on the National Mapping Program. Texture/feature extraction based on these pre-defined map symbols will be performed (all symbols are already labeled). Several hundred thousand satellite images are also available from the Alexandria partners. These images are collected every 18-20 days (i.e., with a strong temporal dimension). Each image is 4557-by-6020 pixels, 10-meter ground resolution and the collection covers all world/US regions. Complementary to the finer-grained (but more out-dated) aerial photos, the 10-meter resolution could produce higher-level geo-concepts such as residential area, up-scale neighborhood, beach, etc. Similar texture extraction and labeling techniques will be adopted. The above-mentioned aerial photos, topological maps, and satellite images will be grouped by state, thereby producing about 150 texture-based "visual" concept spaces of relevance to each US state. Evaluation of Scalability Experiments For each research testbed, user evaluation based on concept recall/precision and document recall/precision will be conducted to validate and refine the concept space approach and results. At Year 1, user evaluation will be performed for the Compendex and Inspec experimental results (documents) and at Year 2, user evaluation will be performed for the Alexandria results (maps). At the end of Year 3, a comprehensive user evaluation based on the integrated concept protocols (documents and maps) for all collections will be conducted. A concept recall/precision experiment will be conducted to examine the correctness of the concept spaces and a document recall/precision experiment will be carried out to examine their usefulness (in assisting search). Concept Recall and Precision Experiment The goal of the first experiment will be to determine the recall and precision levels of the concept spaces generated by the system, as judged by human experts. Knowledgeable subjects will be recruited for each subject domain. We propose to adopt an experimental design similar to those used in human memory association experiments [Anderson 1985] and in thesaurus evaluation studies [Chen, Schatz, Martinez and Ng 1993]. A recall phase, followed by a recognition phase, will be performed in that order. In the recall phase, each subject will be asked to generate through a free association process as many related concepts (noun phrase or images) as possible in response to each test descriptor (previously selected by a panel of domain experts) presented. This phase of the experiment will call upon subjects' memory recall. In the recognition phase, experimenters will create lists of associated, system-generated concepts in random order for subjects to evaluate with regard to their relevance to the test descriptor. For analysis of results from the concept association experiment, we will utilize concept recall and concept precision for evaluation. Rather than examining the number of relevant documents retrieved, we will count the number of relevant concepts generated by the system (as judged by the knowledgeable subjects). They will be computed as follows: ConceptRecall = Number of Retrieved Relevant Concepts/ ConceptPrecision = Number of Retrieved Relevant/ Document Recall and Precision Experiment The goal of the second experiment will be to determine the recall and precision improvements of adopting concept spaces in information access (in comparison with non concept space assisted information access). After validation of the concept spaces generated for each collection, a standard document recall and precision experiment will be conducted. Performance comparison between groups using concept protocols for information access and groups not using concept protocols will be conducted. Knowledgeable subjects will be recruited and asked to perform a set of fixed tasks relevant to the collections. Retrieval results of each search session will be judged by a panel of domain experts to compare recall and precision levels of concept space assisted groups with those of non concept space assisted groups. Designing and Implementing the Interspace Prototype Once the underlying analysis algorithms are available, it is possible to build complete information analysis environments for distributed object repositories. This collaboration infrastructure will support semantic retrieval and vocabulary switching across community boundaries and subject domains. This is a concrete approach to the Grand Challenge of Digital Library Research which is "interoperability at a deep semantic level, providing digital library users with a coherent view of heterogeneous autonomously managed resources" [IITA, 1995]. Figure 2 Interspace conceptual architecture. We have already designed a preliminary architecture for the Interspace and are just beginning the first research prototype implementation. This proposal will enable us to move beyond research prototype into the exploratory development phase. Our basic assumption for the prototypes is that the Internet has completely evolved into a world of distributed objects and that an architecture is needed to construct a world of categorized concepts on top of this. Thus our prototypes assume the existence of an Internet-wide object-oriented operating system that we simulate by using distributed Smalltalk over limited wide-area networks. This conceptual architecture is shown in Figure 2 above. The algorithms will be incorporated in the prototype as a set of CORBA services comprising the semantic interoperability middleware layer. This in turn permits the generation of advanced semantic indexes. These indexes permit a conceptual overlay on the distributed heterogeneous object collections. The Interspace prototype will support community repositories consisting of objects with interlinks to form the information space. Indexing is performed on the collection of objects to generate a space of concepts that can be searched for interactive retrieval. Search across spaces can be performed by vocabulary switching. Intersecting spaces across similar concepts to provide suggestions for how different terms in different subjects correspond to the same concepts. Retrieval can thus be performed on a semantic basis and analysis can be supported by retrieval across multiple spaces. Figure 3 An example of the semantic index generation process. As a concrete example of the capabilities of the Interspace prototype, we illustrate the process of generating the various semantic indexes. The prototype includes standard full text indexing and meta-data searching. Our innovative approach based upon the semantic middleware services is to combine these standard techniques with advanced statistical methods to produce a set of higher level indexes. These indexes give the user a conceptual view of the raw objects. The category map and concept spaces are examples of these conceptual level indexes. The category maps create a hierarchical visual map of the underlying relationships between the objects in the collection. The concept space produces a series of terms and their relationships. These indexes can be used to correlate across different object types (e.g. images and text). Additionally, the user can create specialized categorizations of both personal and shared collections. The example above shows the synergistic effects of all of the indexes when combined into a single semantic interoperability framework. When the Interspace prototype is fully operational, the users will be able to routinely generate semantic indexes for their own personal collections, such as the email message illustrated in Figure 3, as well as access semantic indexes generated off-line for large archival collections, such as those in engineering and science. Semantic indexes will be available at the category, concept, and object levels on a production basis. By having the infrastructure automatically generate semantic indexes at multiple levels for all repositories and by having a transparent multiple view interface to play the indexes off against each other during interactive search sessions, a practical system for semantic interoperability will be constructed. Figure 4 Current Interspace prototype software configuration. We have developed a demonstration of the Interspace prototype using leading edge object technologies. The demonstration integrates a static version of the concept space, category map and full-text search CORBA services. This combination allows the user to quickly navigate the different semantic levels of the INSPEC, Compendex and a small e-mail based design patterns community. Note the use of the multiple CORBA 2.0 compliant ORBs using the IIOP (Internet Interoperability Protocol) as well as multiple object databases. This prototype was demonstrated at DARPA for ITO in October 1996. The underlying software infrastructure for this demonstration prototype illustrates our strategy of simulating the forthcoming Internet-wide operating system by using existing more local object-oriented commercial software. The custom infrastructure that we are developing enables users to interact with multiple views, i.e. multiple windows each displaying an index from a different document collection or a different semantic level. The current demonstration has the results from the Compendex/Inspec text scalability experiment, with 10 million abstracts across 1000 community repositories used to generate concept spaces for each repository. These are all currently loaded into the ObjectStores and available via CORBA across the network. The user can navigate within a space at the concept level or the document level then navigate across spaces as appropriate. For example, they can find desired terms in the concept space, by starting from broad terms then traversing into narrow terms specific to that document collection. Then they can move across into document space to perform a full-text search, by simply dragging the concept term into the document space search line. If they wish to search a subject domain they are not familiar with, they can navigate within one concept space among familiar terms, then choose another concept space for the unfamiliar domain and navigate across spaces, based on common terms, to locate desired terms to search for. This form of interactive vocabulary switching by space navigation is the reason behind calling the system the Interspace. The demonstration already illustrates for large text collections this form of semantic interoperability. Figure 5 shows a session of the current demonstration system. The upper window displays abstract indexes for categories and concepts, while the lower window display concrete indexes for document collections. The pane in the upper left of Figure 5 shows an integrated list of categories from the INSPEC, Compendex and personal collections. The user has navigated a concept space for the repository for Software Engineering Techniques - finding "revision control system" after several traversals through related terms from "object database". They then did a document search by dragging the term from the concept level to the object level in the bottom window. Looking through the retrieved documents, they found a more specific term "complex object" within one document and dragged that term back up to the concept level to locate related terms within Software Engineering Techniques.. The related terms for "complex object" are displayed within the middle pane. Note these include both broader and narrower terms, as well as names such as authors. Interactive vocabulary switching across subject domains is then performed by selecting another repository for Object-Oriented Programming, switching across domains on the common term "complex object", and listing the related terms for that term within the new repository. The new concept space can now be traversed to find more desired terms to search for. Such a fluid flow across levels and subjects supports semantic interoperability. PLAN We have already designed a preliminary architecture for the Interspace [Schatz, 1995] and are just beginning the first research demonstration. Funding this proposal will enable us to move from a feasibility demonstration into a fully-fledged research prototype suitable for experimental use. Our basic assumption for the prototypes is that the (future) Internet has completely evolved into a world of distributed objects and that an architecture is needed to construct a world of categorized concepts on top of this. Thus our prototypes assume the existence of an Internet-wide object-oriented operating system that we simulate by using distributed Smalltalk with CORBA over limited wide-area networks. The software environment for the Interspace prototype will be developed in ParcPlace VisualWorks, the industry standard Smalltalk environment. The repositories will be simulated using the Versant object-oriented database and distributed over wide-area networks using the CORBA implementations provided in Smalltalk by DNS and in C++ by Orbix. The project is organized as follows: Bruce Schatz will be the Principal Investigator responsible for all aspects of the project. His laboratory at the University of Illinois will perform the systems integration and produce the environment and the demonstrations. He is the lead for architecture specification. Schatz is an experienced manager of large research system projects, e.g. he is the PI of the Illinois NSF/ARPA/NASA DLI project ($1M/year for 4 years) which has done some of the foundational research for this project. Charles Herring will be technical lead on building the prototype Interspace environment. He has himself managed ARPA projects of this same scale and technical complexity. He will be responsible for the day-to-day activities (Herring will do the tactics and Schatz the strategy). Hsinchun Chen from the University of Arizona will be the technical lead on specifying and implementing the semantic algorithms. He has closely collaborated with Schatz for seven years on semantic algorithm design, across both the WCS and the DLI projects, as they together developed the concept space technology. New algorithms and new spaces have been transferred into the working research prototypes in both the WCS and the DLI projects, as proposed to continue doing here. Chen's lab will develop the specific semantic software that Schatz's lab directed by Herring integrates into the overall framework to produce complete systems prototypes. Each of these labs already has a team in place working on these problems, who are technically expert for each component. The purpose of this proposal is to guarantee them multi-year support so that a complete Interspace prototype can be built. The Illinois team is centered around building the system. Part of the team is building the semantic services by integrating the algorithms from Arizona into the kernel, e.g. algorithm optimization on parallel computers (Bill Pottenger), computational linguistics and document classifications (Nuala Bennett), neural networks for semantic mapping (Yi-ming Chung). Part of the team is building the kernel that supports multi-view user interfaces and semantic federation across sources, e.g. architecture framework and distributed objects (Kevin Powell), user interfaces and Smalltalk implementations (Les Tyrrell), network interfaces and object-oriented databases (Conrad Chang). Part of the team is displaying the images and maps (Dan Pape). The Arizona team is centered around developing the algorithms. Part of the team is working on object types particularly image analysis, e.g. texture concept spaces from maps (Wei Ma). Part of the team is working on category spaces, e.g. automatic maps and visualization (Marshall Ramsey). Part of the team is working on semantic mapping, e.g. vocabulary switching and spreading activation (Dorbin Ng). Part of the team is working on value filtering, e.g. genetic algorithms and multi-dimensional scaling (Dmitri Roussinov). Duncan Lawrie, Bob McGrath, and Bill Pottenger will work together on developing and implementing algorithms to accomplish real-time semantic retrieval in both software and hardware. These researchers are all at the University of Illiois and represent significant expertise in software and hardware design for distributed applications on parallel computers
|
4.
DESCRIPTION OF RESULTS, 7. TECHNICAL RATIONALE, APPROACH, AND PLAN 8. COMPARISONS TO OTHER RESEARCH
|
|
|
|
........ | ||