...
 
 

This research is supported in part by NSF IRI-9015407
November 1992

.The New Paradigm:
towards dry-lab biology

"The new paradigm, now emerging, is that all the 'genes' will be known (in the sense of being resident in databases available electronically), and that the starting point of a biological investigation will be theoretical. An individual scientist will begin with a theoretical conjecture, only then turning to experiment to follow or test that hypothesis.

To use this flood of knowledge [the total sequence of the human and model organisms], which will pour across the computer networks of the world, biologists not only must become computer-literate, but also change their approach to the problem of understanding life.

The programs that display and analyse the material for us must be improved -- and we must learn how to use them more effectively. Like the purchased kits [of enzymes], they will make our life easier, but also like the kits, we must understand enough of how they work to use them effectively."

Walter Gilbert, Nature (Jan 1991)


.Analysis Environments
towards pattern discovery in the large

Molecular Biology

fundamental applications for uninterpreted data
much data, databases
little theory, analysis
cf. neuroscience, where are dry-lab experiments

Computer Science

fundamental technology for distributed environments
databases, federated heterogeneous
networks, interactive computation

Computational Biology

fundamental systems for pattern discovery
find and forge relationship connections across formal and informal data sources
using software tools of many types
interactive browsing and analysis
community sharing and publishing


TALK OUTLINE

Present Databases

Future Databases

Community Systems

Problems & Solutions


Present Databases
pattern discovery in the small

Status

large archives for standard sources
ubiquitous use by molecular biologists
current development projects
centrally maintained in national libraries
accessible with personal computer front ends
via modem, CD-ROM, file server

Publishing Cycle

retrieve -- submit -- edit
70s literature (MedLine)
80s sequences (GenBank)
90s maps (GenomeDatabase)


Literature Databases
automatic retrieval

MedLine (National Library of Medicine)

Administration

standard repository for abstracts from selected journals
professional indexers generate classification

On-line Retrieval

text-based searching (word matching, field restrictions)
front end on personal computer to central mainframe


Sequence Databases
automatic submission

GenBank (Los Alamos National LibraryINLM)

Administration

standard repository for sequences
central curators check annotations
publishers require accession number for reference
central archive distributes database for retrieval

Electronic Data Publishing

direct author submissions via Authorin program front end for sequences and annotations
greater scale and higher quality


.Map Databases
automatic editorship

GDB - Genome Database (Yale/Johns Hopkins)
Administration

standard repository for human mapping data
distributed curators receive and edit data
central archive distributes database for retrieval

Electronic Editorial Control

separate curator(s) for each chromosome
interface supports generation of consensus map
system tracks approval process and history


Analysis Package
shell-level invocation of separate programs

 GCG package (Genetics Computer, Inc)

Editing (sequences)
Fragment Assembly (gels)
Mapping (fingerprints)
Comparison (graphs)

  • Searching (similarity match)
  • Multiple Sequence Analysis (line up)

Pattern Recognition (codons)
RNA Secondary Structure (circles)
Protein Analysis (motifs)
Translation (proteins)
Manipulation (assembly)

.

After defining the locus entries you wish to retrieve,
you have several options for sorting that data:
Type <Control>K
IN <Return>
 
 Here the data are sorted by name:  
You also can choose among several options for displaying the data.

Future Databases
analysis of archival databases 

Status

Current research prototypes
Used experimentally by biologists
National Center for Biotechnology Information (NCBI)

Functionality

remote analysis servers (Blast)
archival database interconnections (Entrez)
proposed interchange standards (ASN.1)


Remote Analysis Server
dry-lab as part of experimental process

Fast Gene Identification via ESTs
................C. Fields in C.Venter's lab at NIH

.......wet-lab
clone cDNA library
sequence EST (expressed sequence tag)
........dry-lab
compare to known genes using software
start new collaboration to characterize

Blast server on NCBI machine for remote analysis
......
enables setup of production line



Interconnecting Archival Databases
standard-source analysis environment

Entrez: Relationships within/across Sources
........J Ostell in D. Lipman's lab at NCBI/NLM

standard archive sources (MedLine, GenBank)
indexers now generate links between related items

search within each source itself
find related set within source (item similarity)
find related set across sources (follow links)

.

   
   
   
   

.


.Pattern Discovery in the Large
across sources and searches

NCBI Entrez moves towards
.......pattern discovery in the large for archival databases

implementation requires interchange standards:
.......unique ids for inter-database links (accession numbers)
.......common formats for inter-program passing (ASN. 1 + type-semantics)

.

 

Scaling the Worm Community System
build a complete system to understand the technology and observe the sociology

System Functionality

    electronic library: collection for specialized community
    includes formal and informal data and information
    excludes laboratory notebook, real-time CSCW
    nationwide communication and publication medium

CurrentTechno1ogy

    inexpensive personal computers and workstations
    existing NSFNET with T1 loops and Ethernet LANs
    primary emphasis on software development
    custom interface and database software

Model System

    big enough to be interesting but small enough to be doable
    manageable size of data and number of users
    significant existing data sources in electronic format
    traditions of openness and completeness
    model for molecular biology and science in general

Worm Community System (WCS) for nematode worm C. elegans

 


Community Knowledge

Archival Data

Gene Descriptions, Genetic Map
Physical Map, DNA Sequences

Cell Anatomy List, Cell Lineage Tree (Complete)
Nervous System Wiring Diagram (Complete)

WCS:Sampler

Published Literature

    Bibliography, Abstracts
    Full-text, Figures/Tables
    Newsletter Articles, Conference Abstracts

Unpublished Material

Strain Lists
Restriction Maps, Methods
Lab Directory

External Software

Display, Analysis

Community Lore

Notes, History, Procedures


.Knowledge in current WCS

  • Data
     Gene List J.Hodgkin (684 genes; 1988)
     Genetic Map M.Edgley (1008 genes)
     Physical Map J.Sulston (11321 clones)
     DNA Sequences Genbank (373 sequences)
  • Literature
    Abstracts Medline/Biosis (933 of 1467)
    Newsletter complete (1627; Dec 1975 on)
    Meetings 8th Int C. elegans Conf (366)
  • Unpublished
    Strains Stock Center (3483)
    Persons M.Edgley (409)
    Methods Worm Book (27)
  • Software
    Acedb genetic/physical map display
    (R. Durbin/J. Thierry-Mieg)
    Gm exon map display
    (C. Fields/C. Soderlund)
  • Community
    Annotations sample (52)

WCS Status

Release 1 completed summer 1991

    Gnu C++ Unix software under X-windows
    literature (WBG,abstracts), data (genes, maps, sequences)
    runs on Suns with display on Macs
    NSFNET distribution for annotations and updates
    have exchanged 1500 messages with users

.

Now in Labs of initial Users (25 sites)
 Univ Arizona, Tucson  S. Ward
 Univ Colorado, Boulder  W Wood
 Simon Fraser/U British Columbia  D. Baillie, A Rose
 Univ Washington, Seattle  J. Thomas
 Univ Texas, Dallas  L. Avery
 Washington Univ, St. Louis  R. Waterston
 MRC Lab Molec Biol, England R. Durbin
 Univ Missouri, Columbia D. Riddle 
 Univ Pittsburgh R. Russell 
 NINDS, NIH R. McCombie 
 Caltech P. Sternberg 
 MIT R. Horvitz 
 Univ Wisconsin, Madison J. Kimble, P. Anderson 
 Netherlands R. Plasterk 
 Harvard G. Ruvkun 
 Univ Illinois NCSA (collaborators) 
 Univ California, Berkeley B. Meyer 
 Univ California, San Francisco C. Kenyon 
 Indiana University T. Blumenthal 
 Georgia Tech D. Dusenbury 
 Univ Houston R. Hecht, D. Shakes 
 Univ Arkansas R. Reis 
 Northwestern J. Kramer 
 Europe, Japan, Australia  

.System Functionality

Information Spaces

objects interconnected into one logical database
commands uniform across type and location
hyperdocuments with live references

Browsing

Search associative specification
Navigation link following

Filtering

Selection user creation of sets
Analysis pass into external program

Sharing

Annotation description & grouping'
Publishing editorial/privacy control

WCS: Summary  WCS: Login
WCS: Search WCS: Addgene
WCS: Thesaurus WCS: Addlink
WCS: Thesaurus-related WCS: Privacy1
WCS: Lineage WCS: Privacy 2
WCS: Lineage-links WCS: Privacy 3

. Information Space

a collection of interconnected information units
which support transparent manipulation of objects
from multiple heterogeneous distributed sources

federated object-oriented heterogeneous distributed

  • External Objects
    Heterogeneous items gathered from external sources
    Homogeneous transformed into IUs (information units)
    Interconnected relationship graph (information space)
  • Object Federation
    Search transparently across multiple sources
    Types different search/display for various types
    Navigation follow/make connection links
    Grouping commands can operate on sets of IUs

?? what is universal: links, sets, publishing


.Building Interconnections
transforming external data into interconnected information

  • Paper
    e.g. worm newsletter with links to genes
    scan/recognize text, display text/figures; parse text
  • File
    e.g. gene list with links to literature
    reformat, hand-correct; use database references
  • Database
    e.g. physical map with links to sequences
    re-format, live graphical display; use common names
  • Software
    e.g. sequence displayer invoked with gene names
    separate display/interaction; need common windows
  • Objects
    e.g. external analysis, external annotations
    requires canonical representations, unique invariant handles

?? how are links generated: automatic vs manual

.


.Electronic Publishing
towards electronic knowledge in a national community library

  • mechanisms for user-entered data

-- Editorial Control

Refinement Levels and Tracking Mechanisms
 Posted  annotations e.g. notes 
 Moderated  bulletins e.g. newsletters
 Curated  archival e.g. maps
 Refereed  factual e.g. journals

 

-- Privacy Control

Permission Levels and Protection Mechanisms
 Personal  private
 Group  local
 Group  topic
 Community  everyone

.

  • true hyperdocuments with objects and links
    documents become primarily embedded objects

?? variant displays of user-composed data sets


.Network Caching

to provide interadive retrieval across wide-area networks
need to guarantee response time to user

perceptual immediacy is 1/4 second
so is nationwide roundtrip packet transmission

latency dominates bandwidth
and concerned with
semanfic caching rather than streamlined protocols

  • Caching Policies
    fetching dominates replacement for information spaces
    policies determine what and when to fetch

transmit only what needed for interaction
summary versus contents
demand subsets, incremental lookahead

?? which policies makes WANs approximate LANs


Network Protocols
applications level interconnection
  • TCP/IP pitched at level of data transmission
  • enabled interconnection of networks into Internet
  • NSFNET sufficient speed for interactive manipulation
  • new models support transparency across data sources
  • information spaces support interlinked uniform data
  • ITP pitched at level of information interconnection
  • will enable interconnection of information spaces physically residing in distributed heterogeneous sources

potential new information infrastructure:

community-specialized information spaces
interconnected by
standard protocols for information units


Information Retrieval
domain and type specific semantic matching
goal is to aid scientist in discovering patterns
  • Current Technology
    canonical word matching with booleans and proximity
    phrase matching with weighted word frequencies
    primarily useful with precise domain terminology
    fragile outside of domain experts
  • Automatic Generation of Thesaurus
    synonyms for commonly occurring phrases
    "sperm swim using actin" versus "sperm crawl using MSP"
    generate via term frequency co-occurrence on typical corpus
  • Case-Frame Analysis of Relationships
    use parts of speech to determine usage context
    "bombesin stimulates appetite in wolf pups"
    generate Planner-type relations via sentence templates and domain-specific grammar/terms
  • Non-textual Shape Matching
    homology search for DNA sequences used to use only approximate string matching
    now use molecular level equivalence heuristics
    developmental lineage is represented by tree of cells need semantic matching on trees due to
    node equivalences and link weighting

?? Feasibility of Large-Scale Semantic Matching


.Community Systems Architecture
distributed server model for external object manipulation

  • Interface Server
  • Information Server
    type displays, information units
  • Network Server
  • Search Server
  • Object Store
  • Analysis Server
  • External Databases

Pattern Discovery
type-specific semantic matching
  • Sequences / Structures
  • Text
    words and phrases
    thesaurus and context
  • Maps
    ranges
    correspondences
  • Trees
    cell lineages
    neuron connections
  • Networks
    functions of genes
    heuristic links --

pattern discovery in the large


Foreseeable Future
analysis environments for dry-lab biology

Plug and Play Environments [data]

any organism / community
any data source any
analysis program

Interactive User Control [user]

complete search / analysis
complete variant displays
complete hyperset publishing

Universal Virtual Access [net]

any local platform
any remote server
any data connection

Building The Interspace

everything goes in with transparent manipulation
everyone gets credit with community sharing

merging information spaces from other communities:

molecular biology
(worms, flies, yeast, coli, mice, men)
neuroscience
oceanography
other sciences...
other domains...

TOWARDS THE WORLD NET

Today the Worm
Tomorrow the World

...