.....
  ...
.....

Validation: Towards An Experimental Methodology

A key problem to be addressed is how to evaluate Concept Spaces and Category Maps, particularly when we are experimenting with alternative algorithms and implementations for computing indexes of this nature.

Conventionally, the effectiveness of IR systems has been evaluated through a combination of subjective evaluation (of perceived "usefulness" or "goodness") and studies of precision and recall. Precision/recall studies compare the results of queries (either pre-selected or generated by human subjects) to oracular judgements of the "correct" results. For instance, in earlier work we compared an automatically generated Category Map with the human generated index from Yahoo! for the same Web pages [Chen, Houston, Sewell and Schatz 1997].

Our approach to precision/recall studies includes the adoption of an experimental design similar to those used in human memory association experiments [Anderson 1985] and in thesaurus evaluation studies [Chen, Schatz, Martinez and Ng 1995]. A recall phase, followed by a recognition phase, is performed in that order. In the recall phase, each subject is asked to generate through a free association process as many related concepts (noun phrases or image textures) as possible in response to each test descriptor (previously selected by a panel of domain experts) presented. This phase of the experiment will call upon subjects' memory recall. In the recognition phase, experimenters create lists of associated, system-generated concepts in random order for subjects to evaluate with regard to their relevance to the test descriptor. For quantitative analysis of the results of such concept association experiments, concept recall and concept precision are utilized. Rather than examining the number of relevant documents retrieved, the number of relevant concepts generated by the system (as judged by the knowledgeable subjects) is counted. These two metrics are computed as follows:

Although we plan to conduct experiments in this vein in evaluating Concept Spaces, we believe that this methodology may be insufficient for our intended research campaign. First, the precision and recall methodology depends on the use of either a known corpus or an independent, human assessment of the "correct" results for the corpus used. As we will often be studying dynamically generated corpora, we do not have a priori knowledge of their structure. In addition, because we are dealing with large collections, it is very time consuming and may be infeasible to perform precision/recall judgements. For instance, the study cited above took many weeks to evaluate one Category Map for a single collection of 110,000 Web pages [Chen, Houston, Sewell and Schatz 1997].

 Our research will necessarily explore alternative implementations and algorithms which taken together constitute a large design space. We need quick, automated methods to identify the most promising alternatives and to assure the correctness of optimizations. We will need fast and automatic ways to answer three basic questions:

  • Is the result "correct"?
  • Is the result "reasonable" (i.e., not random or otherwise invalid)?
  • Is the result the "same" as the result of an alternative implementation? (E.g., does an optimized version of an algorithm give acceptably "correct" results compared to the original algorithm.)

If we can measure these effectively, we can determine cost/benefit estimates for alternative implementations.

 Validating Concept Spaces

The "correctness" of a Concept Space may be measured in at least the two following ways. First, as discussed above, human memory association experiments can be conducted and the results analyzed using the metrics of concept precision and recall. Alternatively, if a thesaurus generated by domain experts exists for the collection under consideration, it is our hypothesis that a Concept Space may be computationally compared with the domain thesaurus using one of several available quantitative metrics under investigation here in the CANIS lab [Zelenko 1997].

To judge the general "reasonableness" of the results of computing a Concept Space, we can use synthetic input vectors with known structure. If the algorithm reproduces the synthetic input structure, this is evidence that it is operating as expected.

To demonstrate that an alternative algorithm/implementation produces the "same" result given the same input, the metrics referred to previously in [Zelenko 1997] may be used to compute the semantic difference or semantic distance between two Concept Spaces. These metrics offer hope for the needed "fast and automatic way" of determining the correctness of a given algorithm or implementation independent of the size or nature of the collection underlying the Concept Space. To date we have implemented and are testing automatic techniques based on multiple metrics for measuring the semantic distance between Concept Spaces [Zelenko 1997].

Validating Category Maps

A particularly difficult problem is evaluating the output of a Category Map in that small changes in parameters or input data can yield a substantially different but still valid result. Furthermore, earlier work has shown that the categories generated by the self-organizing map are likely to be completely unrelated to human generated categories for the same collection [Chen, Houston, Sewell and Schatz 1997]. Thus, there is no easy way to judge the inherent "correctness" of a self-organizing Category Map, except through subjective human judgements. However, we believe that there may be relatively quick and automatic ways to determine answers to the other two questions above. To judge the general "reasonableness" of the results, we can use synthetic input vectors with known structure. If the algorithm fails to reproduce the synthetic input structure in the output Category Map, this is evidence that it is not a good candidate for detecting unknown structure.

To demonstrate that an alternative algorithm/implementation produces the "same" result given the same input, we can proceed in a way similar to that described above for Concept Spaces. Given an algorithm and implementation that has been shown to be valid through methods such as proof of correctness, human studies, long practice, etc., we can use statistical metrics of comparison to determine the validity of results produced by variations of the algorithm/implementation.

References

[Anderson 1985] J. R. Anderson. Cognitive Psychology and Its Implications 2nd Ed. W H Freeman and Company, New York, N, 1985.

[Chen, Houston, Sewell and Schatz 1997] H. Chen, A. L. Houston, R. R. Sewell, and B. R. Schatz. Internet Browsing and Searching: User Evaluations of Category Map and Concept Space Techniques. Journal of the American Society for Information Science, forthcoming.

[Chen, Schatz, Martinez and Ng 1995]H. Chen, B. Schatz, T. Yim, D. Fye. Automatic Thesaurus Construction for an Electronic Community System. J American Society of Information Science (JASIS), 46(3):175-193.

[Zelenko 1997] Dmitry Zelenko. Validation in Information Retrieval. Community Architectures for Network Information Systems, Graduate School of Library and Information Science, UIUC. .

 

 

 

 

..... ........