Yi-Ming Chung, Qin He, Kevin Powell and Bruce Schatz
CANIS - Community Architectures for Network Information Systems
Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign, Champaign, IL 61820
{chung, hqin, powell, schatz}@canis.uiuc.edu
http://www.canis.uiuc.edu
As part of the Illinois Digital Library Initiative (DLI) project we developed ``scalable semantics'' technologies. These statistical techniques enabled us to index large collections for deeper search than word matching. Through the auspices of the DARPA Information Management program, we are developing an integrated analysis environment, the Interspace Prototype, that uses ``semantic indexing'' as the foundation for supporting concept navigation. These semantic indexes record the contextual correlation of noun phrases, and are computed generically, independent of subject domain.
Using this technology, we were able to compute semantic indexes for a subject discipline. In particular, in the summer of 1998, we computed concept spaces for 9.3M MEDLINE bibliographic records from the National Library of Medicine (NLM) which extensively covered the biomedical literature for the period from 1966 to 1997. In this experiment, we first partitioned the collection into smaller collections (repositories) by subject, extracted noun phrases from titles and abstracts, then performed semantic indexing on these sub-collections by creating a concept space for each repository. The computation required 2 days on a 128-node SGI/CRAY Origin 2000 at the National Center for Supercomputer Applications (NCSA). This experiment demonstrated the feasibility of scalable semantics techniques for large collections. With the rapid increase in computing power, we believe this indexing technology will shortly be feasible on personal computers.
Our major research goal has been pursuing ``scalable semantics'', a technology of indexing that would scale to large collections yet support deeper search than word matching [21]. We have been developing statistical algorithms for processing the term co-occurrences for disciplinary collections comprising millions of documents. These algorithms have been evolved to operate across different domains, with no special tuning required for each subject area.
Throughout the course of the Digital Libraries (DLI) project, we conducted several large-scale semantic indexing experiments for engineering disciplines [21][20][9]. The Compendex database was used to supply documents which provided broad coverage across all the engineering domains, with 800K abstracts chosen from 600 categories from the hierarchical subject classification. Documents from INSPEC were used to supply deep coverage for the core domains of physics, electrical engineering, and computer science, with 400K abstracts chosen from 300 categories. In the Engineering experiment, each abstract was classified into roughly 3 categories so there was overlap across repositories. This generated approximately 4M bibliographic abstracts across 1,000 repositories.
In the summer of 1998, we employed the semantic indexing technique to index the National Library of Medicine's (NLM) MEDLINE collection of 9.3 million bibliographic records covering from 1966 to 1997. This collection covers the entire medical discipline both in breadth and depth. In this experiment, we first partitioned the collection into sub-disciplinary repositories and then performed semantic indexing on these sub-collections by creating a concept space for each repository. On average, each document was placed into 4.5 partitions. Thus, the final computation involved 40M abstracts across 10,000 partitions and required 50 hours computation time using a 128-node SGI/CRAY Origin 2000, NCSA's largest supercomputer array at that time.
The purpose of this experiment is to test the scalability of the semantic indexing techniques on a large scale collection with the ultimate goal to create an unique large-scale testbed for ``semantic interoperability'' [17]. To test semantic interoperability, we need semantic indexes for large collections across multiple subjects. The current semantic indexing technology has shown effective primarily for text documents. Obtaining a suitable collection of text documents with breadth across subjects and depth within subjects mandates bibliographic abstracts. So partitioning the MEDLINE collection into subject repositories and indexing these sub-collections with concept spaces can provide a large testbed for future semantic interoperability experiments.
This paper is organized as follows. Section 2 describes the semantic indexing techniques including noun phrase parsing techniques and concept space algorithm. Section 3 describes the characteristics of the MEDLINE collection and the MeSH Tree classification hierarchy. Section 4 discusses our partitioning strategy for the MEDLINE collection using the MeSH subject classification hierarchy and the semantic indexing computation experiment using NCSA's SGI/CRAY Origin 2000. Section 5 describes the potential usage of the MEDLINE indexing experiment results, called MEDSPACE, as a testbed in the Interspace Prototype. Finally, Section 6 describes the future directions.
The concept space algorithm has been used in numerous experiments to generate and integrate multiple semantic indexes [6][9]. Previous experiments have shown that creating domain independent semantic indexes automatically is feasible and that these indexes can pave the way for cross-domain information retrieval [8].
The concept space algorithm is based upon statistical correlations of the context within documents. To create a concept space, first find the context of terms or phrases within documents using a noun phrase parser and then compute term (noun phrase) relationships using co-occurrence analysis. The algorithms used by this experiment are briefly summarized below.
AZ Noun Phraser [22], the noun phrase extractor developed in collaboration with the AI (Artificial Intelligence) group at the Department of Management Information Systems of University of Arizona, was used for this experiment. The phraser is based upon the Brill tagger [1][2] and the noun phrase identification rules of NPtool [23]. The program was chosen since it is generic-the trained lexicon was derived from several different sources including the Wall Street Journal and Brown corpora, hence the lexicon has a fairly general coverage of the English language. It can be applied across domains without further domain customization while maintaining a comparable parsing quality. The phraser operates in three distinct phases: tokenization, part-of-speech tagging and noun phrase identification.
At this point in the process, the exact part of speech of each token is still in question. To resolve this, the Brill tagger uses a number of contextual rules to disambiguate the term's part of speech. For each token, these rules examine the tokens immediately preceding and following the current token. With this information the tagger is finally able to determine the best part of speech for each token and eventually create a list of tagged tokens.
Noun phrases are recognized by a set of patterns, or rules, that are composed of different orderings of parts of speech. For our experiment, the limit for the longest noun phrase pattern recognizable is seven words in length. The shortest pattern is a noun phrase of length one (e.g. just the noun itself).
The identification of noun phrases is realized by moving a sliding window through the tagged token list. The window starts from the head of the list with a size of seven tokens, i.e., the size of the longest noun phrase recognizable. If the content of the window, the parts of speech of seven tokens, matches one of the noun phrase rules, a noun phrase is identified. If the content does not match any of the rules, the last token in the window will be truncated and the remaining content is compared to the noun phrase rules again. This process is repeated until the window content matches a rule, which could be a single word at the end. The window will then move on and start from the token following the noun phrase just identified, with a size reset to seven. Tokens for characters such as ``, ; : .'' and tokens tagged as verb are treated as noun phrase delimiters. They will truncate the window when they are encountered.
There are a number of limitations of the noun phraser in the current implementation. In scientific literature noun phrases of greater than seven words in length appear to be more common than in general text. This results in the tagger misidentifying long noun phrases as two separate shorter phrases. Additionally, the tagger cannot effectively differentiate between context sensitive noun phrases, which only have meaning in the context of use, and more general forms that have a useful degree of context free meaning. Finally, this type of noun phrase identification is computationally expensive when compared with more ad hoc techniques that rely solely on tokenization and concatenation of adjacent tokens to identify phrases.
The AZ Noun Phraser can also be customized to a particular domain by training the Brill tagger, editing stop words list or incorporating a domain specialized lexicon, e.g. SPECIALIST Lexicon from the Unified Medical Language System (UMLS) by NLM. According to a study conducted by by our collaborators, a SPECIALIST Lexicon enhanced Noun Phraser performed slightly better than the generic version on a collection of 630K CancerLit abstracts but the difference is not statistically significant [22]. The generic nature and ease of customization enable the parser to fulfill the range of noun phrase parsing needs.
During the noun phrase extraction process, we also keep track of noun phrase frequency information, which is used to compute weights for each noun phrase in the documents. The weight of noun phrase j in document i, dij, is computed based on the product of ``noun phrase frequency'' and ``inverse document frequency'' (a common technique adopted in vector space models of IR). Noun phrase frequency, tfij, represents the number of occurrences of noun phrase j in document i. Document frequency, dfj, represents the number of documents in a collection in which noun phrase j occurs. This is represented by equation (1).
where N represents the total number of documents in a collection and wj represents the number of tokens (words) in noun phrase j. Multiple-word terms are assigned heavier weights than single-word terms because multiple-word terms usually convey more precise semantic meaning than single-word terms.
The co-occurrence analysis is computed based on an asymmetric similarity function as follows:
The above equation indicates the similarity weight from term Tj to term Tk. dij is calculated by equation (1). dijk represents the combined weight of both term descriptors Tj and Tk in document i and is defined as follows:
where tfijk represents the smaller number of occurrences of term j and term k in document i. dfjk represents the number of documents in which terms j and k occur together. wjrepresents the number of words in descriptor Tj.
A weighting factor is also used to further penalize high collection frequency terms in the co-occurrence analysis:
Terms with a higher document frequency value, dfk, (possibly more general terms) have a smaller weighting factor value, which causes the co-occurrence probability to become smaller.
The resulting co-occurrence matrix represents a network of noun phrases and their probabilistic relationships. The relationships between noun phrases reflect the strengths of their context associations within a collection.
MEDLINE is a medical bibliographic database maintained by the National Library of Medicine (NLM). It covers the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and pre-clinical science. It contains bibliographic citations from over 3900 biomedical journals published in the United States and 70 foreign countries. In early 1998, we acquired all the bibliographic records in MEDLINE and its backup databases, BACKFILES, covering from 1966 to 1997. The size, 9.3 million abstracts, and the coverage of the corpus is comprehensive for the subject discipline of biomedicine.
Our first task toward indexing this large medical corpus was to partition it into segments by subject domains. The goal was to partition the corpus into a set of domains which would allow users to navigate across these domains.
We investigated several partitioning strategies, based on standard clustering algorithms (a completely automated approach) and on partitions derived from human assigned subject headings (a semi-automatic approach requiring subject categorization). For example, one simple approach is to partition the collection based upon the top (one thousand) most frequently occurring terms. This approach, while straightforward, produces a semantically weak set of clusters, i.e. the clusters produced are highly variable in their meaningfulness. In contrast, human generated classification and thesaurus systems have better properties semantically and can be used to produce more meaningful clusters.
We also ruled out the possibility of using automatic clustering algorithms in this particular experiment. Most of the clustering algorithms are computation extensive. With the collection size of 9.3M, the clustering task itself would not be computationally feasible within a reasonable time frame. These led us to adopt the semi-automatic approach based on an existing human classification scheme. The NLM's Medical Subject Headings (MeSH) was chosen in this experiment.
MeSH consists of a set of terms or subject headings that are arranged in both an alphabetic and a hierarchical structure known as MeSH Tree structure. It has the properties of a thesaurus and a classification system. The hierarchical structure helps users browsing the thesaurus to navigate from a broader concept, a broader term (BT), to narrow concepts, narrow terms (NT) and vice versa. This hierarchical structure is also useful as an aid for retrieval. For example, a user can specify the retrieval of all citations indexed to a particular heading, as well as those indexed to all the narrow terms of the heading.
The 1998 MeSH includes 32284 tree nodes in the hierarchical structure and 18934 main headings, among which 18849 headings were used as labels to MeSH Tree nodes. In the hierarchical tree structure, tree nodes are identified by a unique tree number named MeSH Tree Number (MN), an alphanumerical hierarchical representation of the terms in MeSH. The top level has 15 broad categories and each category can be up to nine levels in depth. Each tree node corresponds to a main heading. As a result, some main headings have duplicate occurrences in the tree structure.
To better understand MeSH subject headings and MeSH Tree structure, we use the MeSH heading ``Suppuration'' as an example to demonstrate the relationship between MeSH headings and MeSH Tree. Figure 1 shows the MeSH entry of Suppuration. The record shows Suppuration is a main heading (MH) and the heading is assigned with two MeSH numbers (MN): C01.539.830 and C23.739.487.85. The annotation (AN) defines suppuration as ``a type of abscess; NIM; TN 178: for suppurative dis & coord.''
![]() |
The MeSH Tree numbers (MN) allow us to browse the broad terms and narrow terms in the tree hierarchy. Since Suppuration is assigned with two tree numbers, we examine the both hierarchical branches in parallel. Figure 2 shows Suppuration: C01.539.830 is branched from C01 (Bacterial Infections and Mycoses) and has 6 sub-nodes. In contrast, C23.739.487.856, is branched from C23 (Symptoms and General Pathology) and has 3 sub-nodes.
Strictly speaking, since both tree nodes use the same MeSH heading, Suppuration, they should represent the same ``concept.'' However, the tree structure allows users to conceive the particular concept in different contexts (branches) of the concept hierarchy.
As mentioned earlier, the current partitioning of MEDLINE is based on MeSH headings, by using the subject hierarchy to specify the collections for the sub-disciplinary repositories. Each MEDLINE bibliographic record has an average of 12.36 MeSH headings assigned by the professional indexers at NLM. Of these headings, an average of 4.6 MeSH terms are indicated by the indexer as main concepts of the article.
Main concepts are those concepts (MeSH headings) which best describe the article. These main concepts were used to determine into which collections to place the particular abstract. Since some abstracts do not have main concepts assigned to them, about 1.4% of abstracts are dropped from the final partitions. Also, since each abstract has an average of 4.6 placements in the partitioned repositories, the multiple placements of each abstract caused an expansion of about 4.6 times which resulted in a total of 46,733,751 repository abstracts computed from raw 9,315,615 abstracts.
We also used the above results to create an ``inclusive'' partition set. Contrary to the above partition scheme, all the abstracts in the narrow-term nodes were propagated to their parent node. We refer the former partition scheme as the ``exclusive'' partition set and this scheme as the ``inclusive'' partition set. The inclusive partitioning caused an expansion of the data by a factor of about 19 and produced 178+ million of repository abstracts.
Table 1 shows both of the top 10 largest domains in the inclusive and exclusive partition sets. In the exclusive set, the largest domain is Liver with 123,082 abstracts while the largest domain in the inclusive set is Amino, Acid, Peptides, and Proteins with 1,373,592 abstracts.
| Inclusive partition | Exclusive Partition | ||
| Domain | # of abs | Domain | # of abs |
| Amino Acids, Peptides, and Proteins | 1373592 | Liver | 123082 |
| Proteins | 1169296 | Brain | 105614 |
| Neoplasms | 910074 | Kidney | 77040 |
| Organic Chemicals | 886533 | Neoplasms | 69721 |
| Immunologic and Biological Factors | 780897 | Hypertension | 66907 |
| Cells | 777939 | Muscles | 65341 |
| Symptoms and General Pathology | 678656 | Escherichia coli | 64652 |
| Cardiovascular Diseases | 655243 | Calcium | 63467 |
| Diagnosis | 606924 | DNA | 58535 |
| Enzymes, Coenzymes, and Enzyme Inhibitors | 602995 | Coronary Disease | 57662 |
In our experiment, the exclusive partition set was used to build the medical disciplinary repositories. Table 2 shows the statistics of the exclusive domains' collection sizes. Since the partitioning was based on the MeSH Tree hierarchical structure, it created 32,284 domains in total, including the domains with a collection size of zero. We computed concept spaces for each exclusive domain which had a collection size greater than 1000 abstracts. This resulted in 9894 disciplinary repositories.
| collection size | Num of Domains | Accum. Total |
10,000 + |
748 | 748 |
| 5,000 - 9,999 | 1,407 | 2,155 |
| 1,000 - 4,999 | 7,739 | 9,894 |
| 500 - 999 | 4,592 | 14,486 |
| 100 - 499 | 9,724 | 24,210 |
| 1 - 99 | 7,053 | 31,263 |
| = 0 | 1,021 | 32,284 |
The final computation involved 45,444,790 unique concepts (noun phrases) and about 19 billion concept co-occurrences within 40,628,964 abstracts of the 9894 repository collections. The rest of the small domains were not computed in this experiment based on the consideration that these domains were too small to be useful, even thought it would only take few seconds for each to be computed. In the inclusive partitions, these small domains were merged to their parent node.
The SGI/CRAY Origin 2000 by Silicon Graphics is a scalable shared memory multiprocessor (S2MP) designed to provide the benefits of both a shared memory multiprocessor approach and a distributed memory message-passing multiprocessor approach. The architecture uses physically distributed memories but treats them as a unified, global address space. This design preserves the benefits of ease of programming on a shared memory architecture without sacrificing the performance scalability of a distributed memory architecture.
The largest NCSA Origin 2000 machine available in March 1998 was used in this experiment. It had 128 processing nodes, 64GB of memory and 420GB scratch disk. Each processor was a 195 MHz MIPS R10000 processor with a peak performance of 390 MFLOPS.