This research is supported in
part by NSF IRI-9015407
November 1992
.The New Paradigm:
towards dry-lab biology
"The new paradigm,
now emerging, is that all the 'genes' will be known (in the sense
of being resident in databases available electronically), and that
the starting point of a biological investigation will be theoretical.
An individual scientist will begin with a theoretical conjecture,
only then turning to experiment to follow or test that hypothesis.
To use this flood of knowledge [the total
sequence of the human and model organisms], which will pour across
the computer networks of the world, biologists not only must become
computer-literate, but also change their approach to the problem
of understanding life.
The programs that display and analyse
the material for us must be improved -- and we must learn how to
use them more effectively. Like the purchased kits [of enzymes],
they will make our life easier, but also like the kits, we must
understand enough of how they work to use them effectively."
Walter Gilbert, Nature
(Jan 1991)
.Analysis Environments
towards pattern discovery in the large
Molecular Biology
fundamental applications for uninterpreted data
much data, databases
little theory, analysis
cf. neuroscience, where are dry-lab experiments
Computer Science
fundamental technology for distributed environments
databases, federated heterogeneous
networks, interactive computation
Computational Biology
fundamental systems for pattern discovery
find and forge relationship connections across formal and informal
data sources
using software tools of many types
interactive browsing and analysis
community sharing and publishing
TALK OUTLINE
Present Databases
Future Databases
Community Systems
Problems & Solutions
Present Databases
pattern discovery in the small
Status
large archives for standard sources
ubiquitous use by molecular biologists
current development projects
centrally maintained in national libraries
accessible with personal computer front ends
via modem, CD-ROM, file server
Publishing Cycle
retrieve -- submit -- edit
70s literature (MedLine)
80s sequences (GenBank)
90s maps (GenomeDatabase)
Literature Databases
automatic retrieval
MedLine (National Library of Medicine)
Administration
standard repository for abstracts from selected journals
professional indexers generate classification
On-line Retrieval
text-based searching (word matching, field restrictions)
front end on personal computer to central mainframe
Sequence Databases
automatic submission
GenBank (Los Alamos National LibraryINLM)
Administration
standard repository for sequences
central curators check annotations
publishers require accession number for reference
central archive distributes database for retrieval
Electronic Data Publishing
direct author submissions via Authorin program front
end for sequences and annotations
greater scale and higher quality
.Map Databases
automatic editorship
GDB - Genome Database (Yale/Johns
Hopkins)
Administration
standard repository for human mapping data
distributed curators receive and edit data
central archive distributes database for retrieval
Electronic Editorial Control
separate curator(s) for each chromosome
interface supports generation of consensus map
system tracks approval process and history
Analysis Package
shell-level invocation of separate programs
GCG
package (Genetics Computer, Inc)
Editing (sequences)
Fragment Assembly (gels)
Mapping (fingerprints)
Comparison (graphs)
- Searching (similarity match)
- Multiple Sequence Analysis (line up)
Pattern Recognition (codons)
RNA Secondary Structure (circles)
Protein Analysis (motifs)
Translation (proteins)
Manipulation (assembly)
.
After defining the locus entries you wish
to retrieve,
you have several options for sorting that data:
Type <Control>K
IN <Return> |
 |
| Here the data are sorted
by name: |
 |
You also can choose among several options
for displaying the data.
Future Databases
analysis of archival databases
Status
Current research prototypes
Used experimentally by biologists
National Center for Biotechnology Information (NCBI)
Functionality
remote analysis servers (Blast)
archival database interconnections (Entrez)
proposed interchange standards (ASN.1)
Remote Analysis Server
dry-lab as part of experimental process
Fast Gene Identification via ESTs
................C. Fields in C.Venter's
lab at NIH
.......wet-lab
clone cDNA library
sequence EST (expressed sequence tag)
........dry-lab
compare to known genes using software
start new collaboration to characterize
Blast server on NCBI machine for
remote analysis
......enables setup of production line
Interconnecting Archival Databases
standard-source analysis environment
Entrez:
Relationships within/across Sources
........J Ostell in D. Lipman's lab at
NCBI/NLM
standard archive sources (MedLine, GenBank)
indexers now generate links between related items
search within each source itself
find related set within source (item similarity)
find related set across sources (follow links)
.
.
.Pattern Discovery
in the Large
across sources and searches
NCBI Entrez moves towards
.......pattern
discovery in the large for archival databases
implementation requires interchange standards:
.......unique ids for inter-database links (accession numbers)
.......common formats for inter-program passing (ASN. 1 + type-semantics)
.
 |
Scaling the Worm Community
System
build a complete system to
understand the technology and observe the sociology
System Functionality
CurrentTechno1ogy
Model System
Worm Community System (WCS)
for nematode worm C. elegans
|
Community Knowledge
Archival Data
|
Gene Descriptions, Genetic Map
Physical Map, DNA Sequences
Cell Anatomy List, Cell Lineage Tree (Complete)
Nervous System Wiring Diagram (Complete)
|
WCS:Sampler |
Published Literature
Bibliography, Abstracts
Full-text, Figures/Tables
Newsletter Articles, Conference Abstracts
Unpublished Material
Strain Lists
Restriction Maps, Methods
Lab Directory
External Software
Display, Analysis
Community Lore
Notes, History, Procedures
.Knowledge in current
WCS
- Data
| Gene List |
J.Hodgkin (684 genes; 1988) |
| Genetic Map |
M.Edgley (1008 genes) |
| Physical Map |
J.Sulston (11321 clones) |
| DNA Sequences |
Genbank (373 sequences) |
- Literature
| Abstracts |
Medline/Biosis (933 of 1467) |
| Newsletter |
complete (1627; Dec 1975 on) |
| Meetings |
8th Int C. elegans Conf (366) |
- Unpublished
| Strains |
Stock Center (3483) |
| Persons |
M.Edgley (409) |
| Methods |
Worm Book (27) |
- Software
| Acedb |
genetic/physical map display
(R. Durbin/J. Thierry-Mieg) |
| Gm |
exon map display
(C. Fields/C. Soderlund) |
- Community
WCS Status
Release 1 completed summer 1991
Gnu C++ Unix software under X-windows
literature (WBG,abstracts), data (genes, maps, sequences)
runs on Suns with display on Macs
NSFNET distribution for annotations and updates
have exchanged 1500 messages with users
.
Now in Labs of initial Users (25
sites)
| Univ Arizona, Tucson |
S. Ward |
| Univ Colorado, Boulder |
W Wood |
| Simon Fraser/U British Columbia |
D. Baillie, A Rose |
| Univ Washington, Seattle |
J. Thomas |
| Univ Texas, Dallas |
L. Avery |
| Washington Univ, St. Louis |
R. Waterston |
| MRC Lab Molec Biol, England |
R. Durbin |
| Univ Missouri, Columbia |
D. Riddle |
| Univ Pittsburgh |
R. Russell |
| NINDS, NIH |
R. McCombie |
| Caltech |
P. Sternberg |
| MIT |
R. Horvitz |
| Univ Wisconsin, Madison |
J. Kimble, P. Anderson |
| Netherlands |
R. Plasterk |
| Harvard |
G. Ruvkun |
| Univ Illinois |
NCSA (collaborators) |
| Univ California, Berkeley |
B. Meyer |
| Univ California, San Francisco |
C. Kenyon |
| Indiana University |
T. Blumenthal |
| Georgia Tech |
D. Dusenbury |
| Univ Houston |
R. Hecht, D. Shakes |
| Univ Arkansas |
R. Reis |
| Northwestern |
J. Kramer |
| Europe, Japan, Australia |
|
.System Functionality
Information Spaces
objects interconnected into one logical database
commands uniform across type and location
hyperdocuments with live references
Browsing
Search associative specification
Navigation link following
Filtering
Selection user creation of sets
Analysis pass into external program
Sharing
Annotation description & grouping'
Publishing editorial/privacy control
. Information Space
a collection of interconnected information
units
which support transparent manipulation of objects
from multiple heterogeneous distributed sources
federated object-oriented heterogeneous
distributed
- External Objects
Heterogeneous items gathered from external sources
Homogeneous transformed into IUs (information units)
Interconnected relationship graph (information space)
- Object Federation
Search transparently across multiple sources
Types different search/display for various types
Navigation follow/make connection links
Grouping commands can operate on sets of IUs
?? what is universal:
links, sets, publishing
.Building Interconnections
transforming external data into interconnected
information
- Paper
e.g. worm newsletter with links to genes
scan/recognize text, display text/figures; parse text
- File
e.g. gene list with links to literature
reformat, hand-correct; use database references
- Database
e.g. physical map with links to sequences
re-format, live graphical display; use common names
- Software
e.g. sequence displayer invoked with gene names
separate display/interaction; need common windows
- Objects
e.g. external analysis, external annotations
requires canonical representations, unique invariant handles
?? how are links generated:
automatic vs manual
.
.Electronic Publishing
towards electronic knowledge in a national
community library
- mechanisms for user-entered data
-- Editorial Control
Refinement Levels and Tracking Mechanisms
| Posted |
annotations |
e.g. notes |
| Moderated |
bulletins |
e.g. newsletters |
| Curated |
archival |
e.g. maps |
| Refereed |
factual |
e.g. journals |
-- Privacy Control
Permission Levels and Protection Mechanisms
| Personal |
private |
| Group |
local |
| Group |
topic |
| Community |
everyone |
.
- true hyperdocuments with objects
and links
documents become primarily embedded objects
?? variant displays of user-composed
data sets
.Network Caching
to provide interadive retrieval across wide-area networks
need to guarantee response time to user
perceptual immediacy is 1/4 second
so is nationwide roundtrip packet transmission
latency dominates bandwidth
and concerned with
semanfic caching rather than streamlined
protocols
- Caching Policies
fetching dominates replacement for information spaces
policies determine what and when to fetch
transmit only what needed for interaction
summary versus contents
demand subsets, incremental lookahead
?? which policies makes WANs
approximate LANs
Network Protocols
applications level interconnection
- TCP/IP pitched at level of data transmission
- enabled interconnection of networks into
Internet
- NSFNET sufficient speed for interactive
manipulation
- new models support transparency across data
sources
- information spaces support interlinked uniform
data
- ITP pitched at level of information interconnection
- will enable interconnection of information
spaces physically residing in distributed heterogeneous sources
potential new
information infrastructure:
community-specialized information spaces
interconnected by standard protocols for information units
Information Retrieval
domain and type specific semantic matching
goal is to aid scientist in discovering patterns
- Current Technology
canonical word matching with booleans and proximity
phrase matching with weighted word frequencies
primarily useful with precise domain terminology
fragile outside of domain experts
- Automatic Generation of Thesaurus
synonyms for commonly occurring phrases
"sperm swim using actin" versus "sperm crawl using
MSP"
generate via term frequency co-occurrence on typical corpus
- Case-Frame Analysis of Relationships
use parts of speech to determine usage context
"bombesin stimulates appetite in wolf pups"
generate Planner-type relations via sentence templates and domain-specific
grammar/terms
- Non-textual Shape Matching
homology search for DNA sequences used to use only approximate
string matching
now use molecular level equivalence heuristics
developmental lineage is represented by tree of cells need semantic
matching on trees due to
node equivalences and link weighting
?? Feasibility of Large-Scale
Semantic Matching
.Community Systems
Architecture
distributed server model for external
object manipulation
- Interface Server
- Information Server
type displays, information units
- Network Server
- Search Server
- Object Store
- Analysis Server
- External Databases
Pattern Discovery
type-specific semantic matching
- Sequences / Structures
- Text
words and phrases
thesaurus and context
- Maps
ranges
correspondences
- Trees
cell lineages
neuron connections
- Networks
functions of genes
heuristic links --
pattern discovery in the large
Foreseeable Future
analysis environments for dry-lab biology
Plug and Play Environments [data]
any organism / community
any data source any
analysis program
Interactive User Control [user]
complete search / analysis
complete variant displays
complete hyperset publishing
Universal Virtual Access [net]
any local platform
any remote server
any data connection
Building The Interspace
everything
goes in with transparent manipulation
everyone gets credit with community sharing
merging information
spaces from other communities:
molecular biology
(worms, flies, yeast, coli, mice, men)neuroscience
oceanography
other sciences...
other domains...
TOWARDS THE WORLD
NET
Today the Worm
Tomorrow the World