Concept Extraction in the Interspace Prototype

Nuala A. Bennett, Qin He, Conrad Chang, Bruce R. Schatz

Digital Library Initiative (DLI) Project
CANIS : Community Systems Laboratory
University of Illinois at Urbana-Champaign

704 S. Sixth Street, Champaign, IL 61820
E-mail: {nabennet, hqin, t-chang2, schatz}
http://www.canis.uiuc.edu

Abstract

This paper describes the concept extraction for the Interspace Research Project. A comparison was undertaken of four parsers for noun phrase extraction - FastNPE, NPtool, Chopper, and AZ Phraser. FastNPE was found to be the fastest of the parsers, and NPtool the most correct in extracting noun phrases. Both were subsequently implemented into the Concept Extractor module of the Interspace Prototype, which is described in detail. Future work on the Concept Extractor will include image concept extraction and this is described in the final section.

KEYWORDS: Interspace, concept space, noun phrase extraction, concept extraction

Introduction

    The Interspace Research Project [9] is developing a prototype environment for semantic indexing of real-world information in a testbed of real collections. The semantic indexing relies on statistical clustering for concepts and categories. Interactive navigation, based on semantic indexing, enables information retrieval at a deeper level than previously possible for diverse large collections. We are in the process of developing algorithms for computing Concept Spaces, Category Maps, and performing Concept Assignment, and testing algorithm utility on engineering literature, medical literature, and map images. The Interspace Prototype will thus enable scalable, interactive semantic interoperability across subject domain, media type, and also, collection size.

    The Interspace analysis environment seeks to unify disparate distributed information resources in one coherent model. It provides for a rich set of support for complex interoperable applications. Standard services include inter-object linking, remote execution, object persistence, and support for compound objects (usually referred to as compound documents).

    The Interspace Prototype serves as a testbed in which several large, real-world collections are available for experimental usage. To date, we have obtained the following collections: INSPEC (approx. 3M abstracts across Computer Science, Physics, and Electrical Engineering); Compendex (approx. 2.6M abstracts across all of engineering); CancerLit (approx. 600,000 abstracts from MEDLARS, dealing with cancer); and Patterns, a community repository of a software engineering discussion list.

    The Interspace Prototype is based upon a layered system design. Figure 1 outlines the Interspace System Architecture, and shows how the various services of the Interspace interact with one another. The service layer supports core functionality required by the kernel. The four major services are:

    The Concept Extractor (CE) interacts directly with the three other services listed, and these are summarized in the following sections. The CE uses a noun phrase extractor. Over many years, we have built heuristic parsers to extract phrases from text in many subject areas. Each parser was tailored to a particular subject area. These experiences lead us to believe that is possible to build a generic noun phrase parser that will extract terms from collections, no matter what the field of study.

    Interspace Architecture Diagram

    Figure 1: Interspace System Architecture

In Section 2, we treat the evaluation of four noun phrase parsers in the context of the Interspace Prototype. Sections 3 and 4 present the implementation of the CE in the Interspace Prototype. At its current stage of development, the CE has been implemented fully for text documents. The final section of the paper discusses Future Improvements, and the imminent implementation of image documents into the CE.

1.1. Concept Space Service

The purpose of the Concept Space service is to automatically generate domain-specific thesaurus subsets, which represent the concepts and their associations in the underlying information corpus. Concept Space generation is based on a statistical co-occurrence analysis which captures the similarity between each pair of concepts [3]. The greater the similarity between concepts, the more relevant they are to one another. Concept Spaces are used in a retrieval environment to assist users in performing functions such as term suggestion [4], [11]. The Concept Extractor functions to extract concepts from corpora for which Concept Spaces are to be generated.

1.2. Category Map Service

The Category Map service classifies an information corpus into different conceptual categories using a variant of Kohonen's self-organizing feature map (SOM) [7]. The SOM is a neural network, which serves as vector quantizer to map high-dimensional feature vectors onto a two-dimensional grid. A multi-layer Category Map is used to form a hierarchical set of categorizations for large corpora. The CE functions to extract concepts from corpora for which Category Maps are to be generated. Several techniques have been incorporated in this service to assist the user in visualizing the hierarchical categorization. The service transforms two-dimensional output maps into three-dimensional terrains which users can navigate via information spaceflight to investigate clusters of semantic locality.

1.3. Concept Assignment Service

The Concept Assigner performs automatic subject indexing based on a variant of a Hopfield network [6]. The Concept Assigner first creates a Concept Space, which serves as the Hopfield network of concepts (nodes) and their associations (weights) within a corpus. To automatically index an individual document, the CE extracts concepts from the document to be indexed, and these concepts become the input pattern to the network. After completion of the Hopfield net parallel spreading activation process, the output from the network produces another set of concepts that are strongly related to the concepts of the input document. Due to the fact that the initial Concept Space contains knowledge obtained from the entire collection, the system is able to find a set of global concepts drawn from the entire collection without being restricted to those concepts present in the given document. These concepts are analogous to concept descriptors (i.e., keywords) of the document.

NOUN PHRASE Extraction

At the beginning of the Interspace implementation, we were using a noun phrase extractor, originally developed by our close collaborators in the AI Lab in the department of MIS at the University of Arizona, and which we have named FastNPE [6] (Fast Noun Phrase Extractor). Although FastNPE has been found to run reasonably efficiently, and give results which initially appear acceptable to users of the Interspace system, in our endeavors to improve the concept space generation, several other noun phrase extractors were evaluated to see how they compared with the existing FastNPE. For the concept space generation, the terms which are extracted should be as complete and accurate as possible. FastNPE is fine-tuned for a particular subject area, each time that a concept space is being run for a particular collection of documents. It operates by using a stopword list, and this list is adjusted for a particular collection. However, if we were to have twenty collections, this system might necessitate having twenty stoplists. It would be more ideal to have a noun phrase extractor that is cross-discipline, and would not need to be fine-tuned in any way for any collection. Therefore, we needed another noun phrase extractor that would handle cross-discipline collections.

In order to evaluate other possible noun phrase extractors, we first looked to see what natural language parsers were available that we could possibly use for our concept space generation. Three were thoroughly tested and compared with FastNPE. These were NPtool [14], which is commercially available from Lingsoft in Finland; Chopper from MIT's Machine Language Understanding Group; and the AZ Phraser [13], a new parser from the MIS group at the University of Arizona. The sample set of documents used for the parser evaluation was taken from the large testbed available for the Interspace project. It included ten full-text articles from the IEEE Computer Society journals, and forty abstracts, evenly distributed among INSPEC, Compendex, CancerLit, and Patterns. This data was taken as a representative sample of the types of texts that would normally be used for concept space generation, being from professional, technical fields, and covering several disciplines.

According to Quirk et al. [8], a noun phrase typically functions as the subject, object, and complement of a clause, and as the complement of a prepositional phrase. Using this definition of a noun phrase, each phrase found in the sample texts was marked. Despite Quirk's rigid definition of a noun phrase, some flexibility had to be given to each parser and its noun phrase extraction. For example, in the simple sentence, "He has red and blue books", "red and blue books" can be interpreted as an entire noun phrase describing books, each of which is of two colors, but also, two different groups of books - "red books" and "blue books". Most automatic parsers do not yet have the capability of the second interpretation, and would also in fact leave out the first adjective, so that "blue books" and its shorter phrase, "books", would be the only terms extracted. Therefore, when evaluating the parsers, none was penalized for not including the longest phrase in such cases.

2.1. FastNPE

In order to extract terms from a text document, FastNPE [6] operates by using a stopword list and a stemlist, combining these lists, and comparing the final list of words with all of the words found in the document. The stopword list contains words typically found in standard stopword lists, such as "a", "the", "by", or "to". It also contains almost 3,500 verbs, adverbs, and adjectives. The stemlist, which is a much shorter list, includes suffixes for verbs and adjectives, such as -ing, -es, or -ly. Each word found in the combined list is "removed" from the document, and the remaining words or phrases in the document are each assumed to be valid terms for concept space generation. From the remaining terms, multi-word terms are further processed to extract the various shorter terms that are contained within the longest term. Figure 2 shows the result of an example sentence processed by FastNPE.

HSNs dorsally placed

abnormal dorsally placed HSNs

abnormal positions e.g

anteriorally head

distal placed HSNs

distal tip positions

dorsally tip

Figure 2: Phrases extracted by FastNPE from the sample sentence, "This results in some very abnormal positions (e.g. anteriorally and dorsally placed HSNs, or distal tip cells in the head)."

NPtool

NPtool [14] was specifically designed as a noun phrase extractor. It has two main operations, the first being where each term of the text is given a context-free description and part-of-speech tag, or tags, and the second where the actual noun phrases are output. Testing was undertaken by electronic mail, using an automatic program that Lingsoft has made available for users to evaluate NPtool.

(this <*> PRON DEM SG @NH)

(result <SV> <P/in> V PRES SG3 VFIN @V)

(in PREP @AH)

(some <Quant> DET CENTRAL SG/PL @>N)

(very ADV AD-A> @AH)

(abnormal A ABS @>N)

(position N NOM PL @NH)

lparen

(e.g. ADV ADVL @AH)

(anteriorally <?> <Nominal> A ABS @NH)

(anteriorally <?> ADV @AH)

(and CC @CC)

(dorsal <DER:ly> ADV @AH)

(place <SVO> <SV> PCP2 (:OR @>N @V))

(hsns <*> <?> <NoBaseformNormalisation> N NOM SG/PL @NH)

comma

(or CC @CC)

(distal A ABS @>N)

(tip N NOM SG @>N)

(cell N NOM PL @NH)

(in PREP @AH)

(the <Def> DET CENTRAL ART SG/PL @>N)

(head N NOM SG/PL @NH)

rparen

fullstop

Figure 3: Sentence (from Fig. 2) tagged by NPtool

An example of the tagging, on the sample sentence from Figure 2, is shown in Figure 3. The second operation of NPtool is where unacceptable or illegitimate tags are eliminated, resulting in the final output which is made up of two lists of noun phrases which are marked either "ok" or "?". The second pass runs in two parallel stages, a NP-friendly parser, and a NP-hostile parser. The results of these two parses are subsequently intersected, and the "ok" noun phrases are those where both parses agree, while those phrases marked "?" are for cases where the two parsing stages did not concur, normally where there was ambiguity in a particular phrase of the text. For example, in the example sentence where "anteriorally" is given two tags (see Figure 3), a choice would have to be made whether or not to keep it as part of a noun phrase. The final list of extracted noun phrases is shown in Figure 4.

ok: tip cell

ok: position

ok: hsns

ok: head

ok: distal tip cell

ok: cell

ok: abnormal position

?: placed hsns

 

Figure 4: Noun phrases extracted by NPtool from sample sentence shown in Fig. 2

2.3. Chopper

Chopper, developed by Dr Ken Haase at MIT, will parse a text, breaking it down into constituent sentences or phrases. The input document can be sent to the parser through MIT's World-Wide Web page. The output text is received back in sentence-reverse order, each term of the text also having been tagged with its part-of-speech tag. Figure 5 shows the example sentence (from Figure 2), having been processed by Chopper. The sentence has been broken down into two phrases, and the output is in reverse order of the phrases. In cases where a word might belong to two phrases, the word will be repeated in the output (see lines 1 and 2 in the Figure).

Figure 5: Output from Chopper

2.4. AZ Phraser

The University of Arizona has been striving to improve the existing method of generating noun phrases, and has developed a new parser, which is known as the AZ Phraser [13]. The Phraser is made up of three parts. Firstly, the text is tokenized, processing the texts to check for items such as punctuation which is adjacent to text words, and to separate irrelevant punctuation marks from the actual text of the document. Next, each word of the text is tagged with a part-of-speech, using the tagger developed by Dr. Eric Brill [1], [2]. Finally, using a set of grammatical rules for noun phrases, a list of noun phrases is output. The noun phrase rules used at this point, were those also used by NPtool in its "NP-friendly" parser, and are listed in [14]. Figure 6 shows the terms extracted from the sample sentence by the AZ Phraser.

results HSNs

abnormal positions distal tip cells

e.g head

Figure 6: Output of parsed sentence (from Fig. 2) by AZ Phraser

2.5. Evaluation of Term Extraction

As well as actual term extraction, there are a number of other processes, which must be undertaken to generate a concept space, and these are discussed further in Section 4. However, for the generation of concept spaces, it has been found that the term extraction takes up the longest amount of time of these processes. As its name implies, FastNPE has been found to be quite efficient in terms of processing speed. Term extraction from SGML-tagged full-text articles from several IEEE magazines, such as "Computer", and "Annals of the History of Computing", was considerably faster than NPtool, as shown in Figure 7.

Despite the efficiency of FastNPE in terms of processing speed, it is apparent that many of the phrases that are extracted are not correct noun phrases. In previous experiments, and depending on the type of text, this noise was removed by editing stopword lists. This has meant having professionals from a particular field of study check the stop lists to see if they are appropriate. However, as the amount of documents in the collections grows, it would be preferable to automate the entire process of generating the phrases, and not have to do any manual checking.

The noun phrases extracted by each parser were compared against the marked phrases in each document. During the evaluation of the parsers on our texts, a number of points arose. Firstly, each text document must be tokenized correctly before the noun phrases are extracted. In pre-processing of a text, NPtool tokenizes much of the text, as does the AZ Phraser, but Chopper did not appear to tokenize as efficiently. Tokenization is a particularly important issue for the types of texts that are used for concept space generation. For example, in biology texts, which might discuss "c.elegans", if that term were tokenized incorrectly to "c" and "elegans", the organism name has been destroyed.

Another important issue is normalization of the texts, where the root form of a word is extracted, or words which were capitalized because they were at the beginning of sentence would be changed to lower case. NPtool normalizes the text entirely, changing everything to lower case, and also outputting all nouns to their singular form, and other parts of speech such as verbs to their root form. This is acceptable most of the time, but for acronyms, which should not be changed to lower case, it causes a problem. Chopper does not normalize texts in any way, although for some terms, it attempts to also output the normalized form of a word in parentheses after the part-of-speech tag. The AZ Phraser does not currently normalize the texts, but it is understood that there are plans to include normalization in future development.

 

Number of articles

 

NPtool Processing

Time (sec.)

 

FastNPE Processing

Time (sec.)

50

503.07

20.23

75

546.06

26.21

103

708.97

34.54

268

2176.5

92.89

915

7686.17

296.08

Figure 7: Performance of FastNPE and NPtool on IEEE full-text articles on a single-processor 200MHz Sun Ultra Sparc 2 with 256 MB RAM

 

FastNPE NPtool Chopper AZ Phraser
Recall 50% 95% 97% 92%
Precision 80% 96% 90% 86%

Figure 8: Recall and precision results

 

Using well-known terms taken from information retrieval literature, the parsers were evaluated using "recall" and "precision" measures. For the purposes of the evaluation, recall was defined to be the number of noun phrases correctly identified by the parser, divided by the number of actual noun phrases. Precision was taken to be the number of words, which were identified as nouns by the parser, divided by the total number of nouns identified in the document.

As can be seen in Figure 8, NPtool was found to have the best overall recall and precision. Consequently, it was chosen to extract noun phrases for concept space generation in the Interspace. However, the main disadvantage with NPtool is that it is a commercial system, and its source code is not available, although a binary version is sold on a year by year basis for research purposes. The AZ Phraser showed great potential for development and use in generating good noun phrases with minimal noise in the output phrases. While it was not ready for production use this year, with some further development, it can be improved to a point where it can be used instead of NPtool. Using NPtool, or the AZ Phraser, will allow generation of concept spaces for universal subject areas without the user having to adjust stop lists manually in the future.

As shown in Figure 7, NPtool is considerably slower than FastNPE. As a result, it was decided to give the user the option of saving processing time, when correctness of extracted phrases is not as important, and to have both FastNPE and NPtool available to the user who must ultimately decide which one to use. Both parsers were included in the CE module, which is described in the following sections.

3. Concept Extractor

The ultimate goal of the Concept Extractor is to extract concepts by subject, and media type, such as text, image, and audio. Thus far, the Concept Extractor (CE) has been implemented fully for text collections. It has been tested on much of the text collections available to the Interspace project. In the future, it will be implemented for image and other media. In Section 5, we have included some discussion of how we are planning to implement concept extraction for image collections. In this section, we will be discussing implementation of the Concept Extractor, shown in Figure 9, specifically for text collections.

The main task of the CE is to extract concepts from documents, and to then create a concept space, which is a critical part of the Interspace. Each document from a collection, such as text, image, or audio, is considered to be a class of source unit. When a document, or source unit, is about to be processed by the CE, the user will have already specified the document attributes in an earlier module of the Interspace. The attributes of a source unit include Type, Policy and Setting. Type specifies the format of a source unit, for example, in the case of text documents, SGML, HTML. If the source unit were image, the format might be gif, tif or jpeg. The Policy specifies the media parser for a particular source unit. A Policy sets the media parser for a source unit, according to its format type. In addition, a Policy also specifies which parts, or fields, such as title, abstract, or author, etc., of a document would be useful for the concept extraction and computation, and the names of these fields are saved into a list. This information will be pertinent for different types of user.

The Setting specifies values for the extraction of concepts and computation of concept space. These values include document collection size, maximum concept size, option to create new concepts, quality, and so on. Setting the collection size is beneficial in helping the Concept Extractor determine an optimal result set. Maximum concept size is used to determine the maximum size (length, in the case of text) of the extracted concept. If the current source unit is a text document, then Quality will be used to specify the quality of the noun phrase extractor - FastNPE, or NPtool. The final output of the Concept Extractor includes a list of concepts.

Figure 9: Structure of the CE (superclasses and subclasses are linked with dashed arrows)

3.1. Concept Extraction in Text Documents

 

In order to process the document source unit, several functions have been implemented. The basic process is as shown in the diagram in Figure 10. According to the Policy, the system will call the appropriate media parser to pre-process the source unit. In the case of the media parser for text source units, the main function is to convert documents from a particular format, such as an SGML-tagged document, into a clear raw text. At the same time, the document is divided into several parts and saved into a list of parse nodes.

The next step is to extract noun phrases from the list of fields according to the Setting. In the Setting of the source unit, the user will specify the Quality required from the noun phrase extractor, and the CE will call the corresponding extractor according to the level of quality desired. For higher values, NPtool will be called. In addition to the extracted noun phrases, FastNPE outputs the frequency of each noun phrase in a document. NPtool does not automatically count the frequency of occurrence of the phrases extracted, so a separately developed function has been implemented for that purpose (see Section 4.3). The final output is a list of noun phrases and their frequency of occurrence.

The final step in this process is the actual concept extraction, which is done by the concept extractor. The system retains a list of global concepts, which is compared with the list of extracted noun phrases. If a noun phrase is found in the global concept list, it will be saved in the extracted concept list of the current source unit. If it is not found, a new concept will be created for it, or it will be inserted into a noun phrase list. This is determined by the value of the option Create in the setting of the current source unit. At this stage, the source unit will get a list of concepts, and a noun phrase list. The list of concepts is used for the final computation of the concept space.

 

4. CONCEPT EXTRACTION Development

The development of the CE involved much experiment and testing on collections. Based on experience during this development phase, the decisions for concept extraction discussed in the following sections were made.

Figure 10: The CE Process

4.1. Field Names

For text documents, when the media type is being parsed, not all fields are parsed fully. Firstly, the field names are specified in the policy as a list of strings, which is fixed for each Policy class. For example, for SGMLPolicy, the field list includes title, author, keyword (or thesaurus term), abstract, body and citation. For HTMLPolicy, only the title and body fields are kept. The fields that are kept is determined by the format of the documents in a particular collection, taking note of the field tags, for example, SGML tags such as <ti>, or <au>. While testing the Concept Extractor, several concept spaces were created, based on the different field tags. It has been found, from valuable user feedback, that some fields are more useful to the computation of concept space than others. Even within one field, different parts can have different effects on the quality of the concept space. For example, in the citation field, the title of the cited references are retained, but no author or source journal name is kept.

4.2. Thesaurus Terms

When the noun phrase extraction process is being run, rather than extracting any phrases from the author fields, or the thesaurus terms, these terms are automatically retained. For the author field, it is intuitively obvious that an author name is also a noun phrase. It saves some processing time by not running a noun phrase extractor on author names. For thesaurus terms, we know that some terms, such as "object-oriented", or "real-time" (from the INSPEC thesaurus) are not actual noun phrases. If the noun phrase extractor were to be run on such phrases, they would not appear in the noun phrase list, yet they are evidently useful to generate a good concept. Therefore, the concept space will be more useful to the user if such phrases are kept, and no noun phrase extraction on such terms will save processing time and computation.

4.3. Noun Phrase Frequency

NPtool does not give any information about the frequency of occurrence for the extracted noun phrases. This information had to be extracted using the information, which is given in the files of phrases and their part-of-speech tags, created by NPtool. When NPtool is run on a document, there are two output files. The first is the .cgp file, which lists all of the words in the document with their part-of-speech tags. The second file is that which actually contains the noun phrases.

From the information given in the .cgp file, a noun phrase can be created by linking each word with a "@NH" tag with the words before it which have a "@>" tag, and the words which have a "@<" tag after it. Once a word is linked, a new noun phrase is created. Based on the functions of each word given in the ".cgp" file and the basic linking rules of NPtool, the noun phrases are constructed and subsequently, their frequency is counted. To ensure accuracy, these ‘calculated’ noun phrases will be compared with the noun phrases in the second file to ensure that all the noun phrases are included. Those phrases that were not found in the second file are discarded. The last step ensures that the integrity of NPtool is maintained. In the future, we believe that this method of calculating the noun phrase occurrence will also be useful when it is necessary to keep track of the position of a concept in a document, or to limit the size of a concept in future research. We have planned other improvements to the CE, and these are discussed in the following section.

5. Future Improvements

The Concept Extractor is now fully operational for text collections, and it has been used to extract concepts from a collection of over 600,000 abstracts in CancerLit. We are continually working on many improvements to the design, some of which are discussed in Section 5.1. Section 5.2 deals with planned implementation of concept extraction from image collections.

5.1 Text Concepts

The noun phrase extraction is not run on author or thesaurus terms, as these are very obviously terms, which do not need to be parsed further (see Section 4.2). However, there exist documents where author-assigned terms are not in a standard thesaurus. This results in some pollution in the concept space, such as "cancer" and "cancers" appearing as different concepts in the concept space for CancerLit, when they should in fact be the same concept. Such cases will be pruned so that these discrepancies do not occur needlessly.

Finally for text collections, there is no limitation at present on the size of a concept, although it is possible to set a limit in the source unit Setting. All the concepts that are generated by the concept extractor are kept, no matter how long they are. This results in having many very long concepts, which are almost always too specific to be used in the search process by a user. Examples of such long phrases include "development of analytical design equation for gas pipeline" or "action shifting from computing to communication". Typically, a user does not use such a long concept to search in a collection, and it is probably not necessary to keep them in the concept space. The limitation of the size of a concept is not simply a limitation on the number of terms in one concept, and it will need much further research with users.

 

Image Concepts

The definition of a concept for an image collection is analogous to that for a text document collection. Just as in a text document, an image is a source unit processed by the CE. Instead of a set of noun phrases, however, the concepts for an image are a set of objects, which have been identified using both manual and automatic techniques. In the same sense that a concept (noun phrase) might include several words, each concept (object) in an image will include a set of visual features, such as the RGB values for color and coarseness, contrast, and directionality for texture. All of these values will be stored as a feature vector associated with each concept.

The nature of the concepts and their associated feature vectors are related to the domain of the image collection. We are currently working with a collection of aerial photos, which includes approximately 11K US aerial photos at the University of California at Santa Barbara Map and Imagery Library. Each photo is about 55 MBs in size, with 6224-by-7687 pixels, 1-meter ground resolution, and is in JPEG PBMT storage, based on 1983 standards. At 1-meter resolution, fine-grained geo-concepts such as houses, schools, factories, or parks, can be easily identified. The images will be segmented into feature regions containing texture tiles. Whereas text collections are segmented according to terms, based on noun phrases, image collections are segmented according to textures, or objects, based on orientation on pixel densities.

The concept extraction process for images is summarized as follows. Firstly, a concept (object) list is created manually. This step is to identify representative objects, which will form a part of the object list. Next, the visual feature vectors for each representative object must be computed, and these features are then linked with the location of the object in the image. The object list is stored for later use in object filtering of the actual images.

After this process, concepts are automatically extracted from images based on the object list created in the first step (described above). Each image is segmented into its constituent objects automatically using segmentation routines. The feature vectors of each such extracted object and its coordinates in the image are then saved. Next, each such extracted object is compared with the representative objects in the object list created at the beginning. This comparison consists of a calculation of the difference between two feature vectors. The extracted object is identified with the representative object to which it is most similar. This results in a list of concepts (objects) which have been extracted, identified and located in a given image. This list of objects with its location information will be used later to compute a Concept Space based on object proximity in the image. We will then have Concept Spaces for images, as well as for text documents.

The Concept Extractor has been applied to text documents, and is fully operational. It will shortly be implemented for image documents. We are optimistic that generic extraction procedures can successfully transform real-world heterogeneous objects into concepts. In the next few years, we plan to show that such procedures will scale across subject domains and media types across a wide variety of real-world materials. Then it will be possible to build the Interspace, where users interact directly with spaces of concepts and categories, which are generated automatically from the normalized concepts produced by the extraction procedures. This new level of functionality will help bring information analysis to the Net.

ACKNOWLEDGEMENTS

This work is supported by the National Science Foundation, under NSF Cooperative Agreement No. IRI 94-11318. The authors would like to thank their colleagues at the AI Lab, Dept of MIS at the University of Arizona, Dr Hsinchun Chen, Dorbin Ng, and Kristin Tolle. They also would like to particularly thank Dr William M. Pottenger for his useful comments and help in preparing this paper.

Brill, E, Marcus, M. Tagging an Unfamiliar Text with Minimal Human Supervision. Intelligent Probabilistic Approaches to Natural Language. Fall 1992 Symposium.

Brill, E. Some Advances in Transformation-based Part of Speech Tagging. Proc. Nat. Conf. on Artificial Intelligence, 1:722-727, 1994. AAAI, Menlo Park, CA.

Chen, H., Schatz, B.R., Yim, T., Fye, D. (1995) Automatic Thesaurus Construction for an Electronic Community System, Journal of the American Society for Information Science (JASIS), 46(3): 175-193.

Chen, H, Martinez, J., Kirchhoff, A., Ng, T., Schatz, B.R. 1997 Alleviating Search Uncertainty through Concept Associations: Automatic Indexing, Co-occurrence Analysis, and Parallel Computing. Journal of the American Society for Information Science (JASIS), to appear.

Chen, H., Ng, T.D., Martinez, J., Schatz, B.R. A Concept Space Approach to Addressing the Vocabulary Switching Problem in Scientific Information Retrieval: An Experiment on the Worm Community System. JASIS, 48(1): 17-31, 1997.

Chung, Y., Pottenger, W.M., Schatz, B.R. Automatic Subject Indexing using an Associative Neural Network. Submitted to DL Conference '98.

Kohonen, T. Self-Organization and Associative Memory. 3rd Edition. Springer-Verlag, Berlin Heidelberg. 1989.

Quirk, R. et. al. A Comprehensive Grammar of the English Language. Longman, 1991.

Schatz, B.R., Mischo, W., Cole, T., Hardin, J., Bishop, A., Chen, H. Federating Diverse Collections of Scientific Literature, IEEE Computer 29: 28-36, 1996.

Schatz, B.R. Information retrieval in digital libraries: Bringing search to the Net. Science, 275 (5298) : 327-334, 1997.

Schatz, B.R., Johnson, E., Cochrane, P., Chen H. 1996. Interactive Term Suggestion for Users of Digital Libraries: Using Subject Thesauri and Co-occurrence Lists for Information Retrieval, 1st Int. ACM Conference on Digital Libraries, Bethesda, MD, pp. 126-133.

Schatz, B.R., Chen, H. Building Large-Scale Digital Libraries. IEEE Computer, 29 (5): 22-26, 1996.

Tolle, K.M, Chen, H, Ng, T. Improving Concept Extraction from Text using Natural Language Processing Noun Phrasing Tools: An Experiment in Medical Information Retrieval. 1997. Submitted to Journal of the American Medical Informatics Association.

Voutilainen, A. NPtool: A Detector of English Noun Phrases. Proceedings of the Workshop on Very Large Corpora, Columbus, Ohio, 6/22/1993.