Current Protocols Library
CURRENT PROTOCOLS IN BIOINFORMATICS
FRONT MATTER
PUBLICATION INFORMATION
CURRENT PROTOCOLS IN
BIOINFORMATICS
FRONT MATTER
PUBLICATION INFORMATION
EDITORIAL BOARD
Andreas D. Baxevanis (Editor-in-Chief)
National Human Genome Research Institute
National Institutes of Health
Bethesda, Maryland
Daniel B. Davison (Editor-in-Chief)
Bristol-Myers Squibb Pharmaceutical Research Institute
Hopewell, New Jersey
Roderic D. M. Page
University of Glasgow
Glasgow, Scotland
Gregory A. Petsko
Brandeis University
Waltham, Massachusetts
Lincoln D. Stein
Cold Spring Harbor Laboratory
Cold Spring Harbor, New York
Gary D. Stormo
Washington University School of Medicine
St. Louis, Missouri
SERIES EDITOR
Shonda Leonard
Rockville, Maryland
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (1 / 2) [2002-12-19 20:30:23]
Current Protocols Library
Copyright © 2002 by John Wiley & Sons, Inc.
All rights reserved.
Reproduction or translation of any part of this work beyond that
permitted by Section 107 or 108 of the 1976 United States Copyright Act
without the permission of the copyright owner is unlawful. Requests for
permission or further information should be addressed to the
Permissions Department, John Wiley & Sons, Inc.
While the authors, editors, and publisher believe that the specification
and usage of reagents, equipment, and devices, as set forth in this
book, are in accord with current recommendations and practice at the
time of publication, they accept no legal responsibility for any errors or
omissions, and make no warranty, express or implied, with respect to
material contained herein. Moreover, the information presented herein is
not a substitute for professional judgment. In view of ongoing research,
equipment modifications, changes in governmental regulations, and the
constant flow of information relating to the use of experimental reagents,
equipment, and devices, the reader is urged to review and evaluate the
information provided in the package insert or instructions for each
chemical, piece of equipment, reagent, or device for, among other
things, any changes in the instructions or indication of usage and for
added warnings and precautions. This is particularly important in regard
to new or infrequently employed chemicals or experimental reagents.
Library of Congress Cataloging in Publication Data:
Current protocols in bioinformatics / editorial board Andreas Baxevanis
(editor-in-chief) and Daniel B. Davison (editor-in-chief) [et al.].
v. ; cm.
Includes index.
ISBN 0-471-25093-7 (cloth : alk. paper)
From Current Protocols in Bioinformatics Online
Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (2 / 2) [2002-12-19 20:30:23]
Current Protocols Library
CURRENT PROTOCOLS IN BIOINFORMATICS
FRONT MATTER
FOREWORD
FOREWORD
During the last 25 years, computers have moved from being an esoteric
tool of the mathematicians and physicists into the mainstream of our
daily existence. Increasingly, they are an essential component of
modern living. Nowhere is this more apparent than in biology, where the
combination of vast databases of information and clever computer
programs to manipulate and mine that data now permeate the practice
of our science. The new discipline of bioinformatics has not only gained
credibility, but is being offered in courses throughout our colleges and
universities. In some forward-looking institutions, whole departments
dedicated to bioinformatics are springing up.
Despite this move to the mainstream, for many molecular biologists,
some of whom I will charitably call "more mature," bioinformatics
remains something of an enigma. Not quite sure what it means and
being unable or unwilling to tinker with a computer themselves, they
have nevertheless realized its importance for their research. They have
been happy to harness the computer-savvy graduate student in their
group, who prefers to sit behind a terminal rather than stand over a lab
bench. However, they have often been frustrated by their lack of ability
to either perform the analyses themselves or even to know the
limitations of the results. Fortunately, help is at hand. Now, anyone who
needs to know more about bioinformatics, and especially how to do it
themselves, should find this book Current Protocols in Bioinformatics,
and its constant updates, to be especially valuable.
Because bioinformatics is very much a hands-on subject, this latest
addition to the Current Protocols series will be much welcomed. Both the
novice user and the more knowledgeable, but occasional, user will find
the information in this book to be well presented and most helpful. While
not a tutorial, the examples chosen for inclusion introduce the reader to
all of the essentials of bioinformatics in a format that will make it easy for
even the most mature professor to work through. When that eager
graduate student finally produces the sequence of your favorite gene,
you will be able to retreat to your office. There you will be able to consult
this book and undertake a comprehensive bioinformatics analysis
yourself, merely by following the protocols. If you are lucky you may
even be able to impress that graduate student with your own erudition,
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (1 / 2) [2002-12-19 20:30:30]
Current Protocols Library
when you discover some novel property of the gene that was predicted
by one of the tools illustrated.
Since the landmark publication in 1995 of the first complete sequence of
a free-living organism, the bacterium Haemophilus influenzae, genomic
biology has flourished. By using DNA sequence to serve as a framework
upon which to think about the workings of organisms, a rigor has
entered biology that had previously been reserved for the "hard"
sciences. Most remarkably, in the last seven years we have learned how
little we know about biology and just how much remains to be
discovered. Thanks to bioinformatics, we are beginning to make inroads
in our understanding of DNA sequences and are making progress in
predicting the biological properties of the organisms with which we share
this planet. Properly used, as illustrated in the protocols of this book,
bioinformatics can be a wonderful generator of hypotheses. As a
discovery tool it is unparalleled. To the biologists of the twenty-first
century, a good working knowledge of bioinformatics may be more
important than learning how to run a centrifuge. But do not abandon that
centrifuge just yet. The very best biologists will combine their knowledge
of bioinformatics, with the skepticism that demands those hypotheses be
tested experimentally. In this way we can be assured that bioinformatics
and biological reality will keep in step.
Richard J. Roberts
New England Biolabs
Beverly, Massachusetts
From Current Protocols in Bioinformatics Online
Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (2 / 2) [2002-12-19 20:30:30]
Current Protocols Library
CURRENT PROTOCOLS IN BIOINFORMATICS
FRONT MATTER
PREFACE
PREFACE
INTRODUCTION
The field of bioinformatics has come into full view recently, primarily
because of the significant advances made by the Human Genome
Project and other systematic sequencing projects, and the necessity for
all biologists to be able to apply—at some level—these techniques to
their own research. It may come as a surprise to most readers that the
origins of the field of bioinformatics go well back into the 1960s, with the
pioneering work performed by Margaret Dayhoff and her colleagues,
who looked at a then limited number of protein sequences. The work
performed by Dayhoff and her colleagues set the stage for the field as
we know it today.
Bioinformatics occupies a unique niche amongst the sciences, lying at
the intersection of biology, genetics, biochemistry, computer science,
mathematics, statistics, and numerous other allied fields. The inherent
strength of the field of bioinformatics comes from the relationships
between investigators in these allied fields; collaborations between
these individuals has led to (and will continue to lead to) the
development of novel methods and approaches, furthering advances in
each of these areas. Such collaborations also set the stage for the
piloting of experiments on computers, followed by the verification of the
computational results in the laboratory.
The central role of bioinformatics has been highlighted by numerous
studies, including one by the Biomedical Information Science and
Technology Inititiative (BISTI;
This task force
underscored the importance of bioinformatics support and education and
its critical role in the advancement of modern science; without
bioinformatics-based techniques, the scientific community would not be
able to extract, view, or analyze the data being generated by any type of
large-scale study, whether it be at the genomic, transcriptomic, or
proteomic level. It becomes quite apparent that, regardless of the area
of expertise of any given biologist, a firm grasp of basic bioinformatic
techniques will become an essential—and indispensable—part of the
"scientific arsenal" in tackling biological problems from now on.
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (1 / 6) [2002-12-19 20:30:36]
Current Protocols Library
OVERVIEW AND PHILOSOPHY
Current Protocols in Bioinformatics is designed to provide the
experimentalist with insight into the types of data and protocols required
to perform basic tasks in the area of bioinformatics. More importantly, it
provides insight into understanding and properly interpreting the data
produced by these methods. The Current Protocols series is known for
its fast and timely publication of valuable and cutting-edge methods; this
book takes that mandate one step further. Initial online installments are
being offered in advance of the publication of the print manual. This
enables us to deliver much needed methods as soon as they are
available. The topics described below reflect the planned content for the
first year's worth of installments.
One of the most important things that the Editors and individual authors
contributing to this work can do is to drive home the importance of
manually inspecting the data produced by these methods—even though
a particular method may produce a result, the method may not actually
be biologically relevant or make any sort of sense in the context of the
experiment being performed. There is never any substitute for manual
inspection of results, with sophisticated users keeping their "biology hat"
on as they peruse the results provided by the computer.
The overall organization of Current Protocols in Bioinformatics is the
product of a significant amount of discussion between the Editors, who
have brought to bear their own individual experience from both research
and teaching in how to best convey a logical, workflow-based path
throughout the various concepts presented herein. Current Protocols in
Bioinformatics begins with a discussion of the most commonly used
sources of public data, giving the reader an appreciation for the types of
questions that can be answered using publicly available databases
(Chapter 1). With this as a basis, the book then marches through the
major topics within the field of bioinformatics. First, the reader is
introduced to methods allowing for the recognition of functional domains
(Chapter 2), both at the nucleotide and protein level. These concepts are
expanded upon in the following chapter, devoted to similarity searching
and the inference of homology, providing the reader useful information
regarding the differences between the types of available search
algorithms and the reasons for finding homologs (Chapter 3).
One of the major goals of the Human Genome Project is to identify all
genes within the genome, and Chapter 4 is devoted to methods on this
front, as well as to gene-finding strategies and cautions. Moving up in
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (2 / 6) [2002-12-19 20:30:36]
Current Protocols Library
complexity, Chapter 5 will cover topics related to molecular modeling,
including methods such as homology model building and visualization of
molecular models. Chapter 6 invokes the interrelationships between
proteins from an evolutionary standpoint, providing the reader with an
understanding of the concepts behind both conservation and evolution
of function within the cell. Chapters 7 and 8 will provide the reader with
an appreciation for the interrelatedness of molecular processes; in
Chapter 7, this is presented from the standpoint of gene expression and
the analysis of gene expression patterns, while in Chapter 8 it is
presented from the standpoint of intermolecular interactions.
Since so much of bioinformatics and computational biology is dependent
upon databases, a thorough treatment of the construction of databases
is included (Chapter 9). While this may seem outside the scope of what
some biologists would do themselves, more and more biologists are
actively involved in the creation of databases for the warehousing of
data generated by their own laboratories.
Chapters 10 and 11 will deal with large data sets, in respect to both
assembling massive amounts of sequence-based data and then
performing comparisons between such large data sets. Finally, we will
cover the computations behind the application of mass spectrometry to
relevant biological questions (Chapter 12), as well as the techniques that
can be used at the RNA level (Chapter 13), methods that are
unfortunately often overlooked.
HOW TO USE THIS MANUAL
Format and Organization
This publication, currently available online, will be published in the
traditional Current Protocols looseleaf and CD-ROM formats by the end
of the fourth installment.
Each chapter in this work represents a general subject area, with
individual protocols contained in units within each chapter. In general,
each unit describes a method and includes one or more protocols. Each
protocol provides information on required resources, steps and
annotations, data interpretation, and commentaries on the "hows" and
"whys" of the method. In addition, each chapter has an overview unit,
providing a broad perspective on the general subject area, as well as
any theoretical discussion that the reader will need as a foundation for
the material covered in the individual units within that chapter. Since this
field is Web-intensive, links to useful resources are provided in each
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (3 / 6) [2002-12-19 20:30:36]
Current Protocols Library
unit.
Introductory and Explanatory Information
Since this publication is, first and foremost, a compilation of techniques
in bioinformatics, explanatory information aimed at giving the reader an
intuitive grasp of the procedures is included. As stated above, chapters
begin with overview units that provide biological context for the
procedures that follow in that chapter. Each unit contains an Introduction
that describes how the protocols that follow connect to one another, and
annotations within the protocol itself describe the particulars of each
step in the method.
Where relevant, the unit authors have provided sample data sets that
the reader can use to reproduce the output presented in their units.
Readers are strongly encouraged to make use of these data sets (found
on the Current Protocols Web site), both from the standpoint of
understanding how to structure their own raw data, as well as to gain
first-hand experience with the methods themselves.
As one can imagine, none of this material is of any use in the absence of
an explanation of how one should interpret the output from any given
method. Each protocol-based unit provides a separate section on
Guidelines for Understanding Results. The individual authors, experts in
their respective fields, have taken great care to provide the user with a
basic understanding of how to interpret their results. In some cases,
examples of bad or misleading results are also given, thereby helping
the reader develop a critical perspective on the use of these methods.
Finally, each protocol-based unit closes with a Commentary, giving
background information regarding the historical and theoretical
development of the method, as well as alternative approaches, the
importance of critical parameters used in the protocol, and different
approaches that could accomplish the same end. All units contain
references to the primary literature, which the user is encouraged to
read to gain a better appreciation for the methods described in the
protocols.
Protocols
Many units in Current Protocols in Bioinformatics contain groups of
protocols, each presented as a discrete series of steps. The Basic
Protocol, presented first in each unit, is the generally recommended or
most universally applicable approach. Alternate Protocols are provided
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (4 / 6) [2002-12-19 20:30:36]
Current Protocols Library
where variations on the Basic Protocol can be employed to achieve
similar ends, or where requirements for the end result vary from those
for the Basic Protocol. Support Protocols describe additional steps that
are required to perform the Basic or Alternate Protocols and that stand
alone as "subroutines."
A series of appendices is provided, with information on concepts that are
applicable across the individual chapters and units. These appendices
include examples of common file formats, the interconversion between
common file formats, basic Unix commands, and the use of X-Windows.
In order to remain accessible to the typical biologist, a strong emphasis
has been placed on Web-based solutions. In many cases, though, a
Unix-based method may be described, either because it is the only type
of solution available, or because it provides distinct and significant
advantages over any available Web-based version of the same
program.
Most of the protocols included in this manual are used by our own
research groups as a routine part of our everyday work. As such, we
have learned many of the intricacies of the programs, and have made an
effort to share this information with the readers of Current Protocols in
Bioinformatics. Critical steps and parameters are annotated where this is
appropriate, providing the reader with a "troubleshooting guide" as well
as an insight into "tricks of the trade."
Reader Feedback
The successful evolution of this manual into a resource that meets the
needs of its readership depends not only upon the perspective and
expertise of our colleagues, but upon the observations, experiences,
and suggestions of our readership. A reader-response survey can be
found on the Current Protocols in Bioinformatics Web page, and we
strongly encourage our readers to use this survey to provide us with
their constructive comments.
Acknowledgements
There are many individuals whom we must thank, without whose efforts
this work would not have become a reality. First and foremost, our
thanks go to all of the authors whose individual contributions make up
this work. The expertise and professional viewpoints that these
individuals bring to bear go a long way in making this work's content as
strong as it is. We also thank our Senior Editor, Ann Boyle, as well as
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (5 / 6) [2002-12-19 20:30:36]
Current Protocols Library
our Developmental Editor, Shonda Leonard, for their wisdom, patience,
and support in helping to shape Current Protocols in Bioinformatics into
a strong, valuable resource for the biological community. We are
fortunate to have them on our team, and look forward to continuing our
work with them as this work continues to grow and evolve. Other skilled
members of the Current Protocols staff who contributed to the success
of this project include Scott Holmes, Tom Cannon Jr., Michael Gates,
and Joseph White. The extensive copyediting required to produce an
accurate protocols manual was ably handled by Allen Ranz, Tom
Downey, and Susan Lieberman.
Andreas D. Baxevanis, Daniel B. Davison, Roderic D. M. Page, Gregory
A. Petsko, Lincoln D. Stein, and Gary D. Stormo
From Current Protocols in Bioinformatics Online
Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (6 / 6) [2002-12-19 20:30:36]
Current Protocols Library
CURRENT PROTOCOLS IN BIOINFORMATICS
FRONT MATTER
CONTRIBUTORS
CONTRIBUTORS
The listings below note the current affiliations of contributors to
Current Protocols in Bioinformatics (i.e., these affiliations
supersede those listed at the end of each protocol). The list will
be updated annually.
Timothy L. Bailey
University of Queensland
Brisbane, Australia
Andreas D. Baxevanis
National Human Genome Research Institute
National Institutes of Health
Bethesda, Maryland
Judith A. Blake
The Jackson Laboratory
Bar Harbor, Maine
Enrique Blanco
Universitat Pompeu Fabra
Barcelona, Spain
Andrew Conway
Silicon Genetics
Redwood City, California
Daniel B. Davison
Bristol-Myers Squibb Pharmaceutical Research Institute
Hopewell, New Jersey
Bjarte Dysvik
University of Bergen MolMine AS
Bergen, Norway
Olivier Gascuel
Equipe "Methodes et Algorithmes pour la Bioinformatique"
LRMM-CNRS
Montpellier, France
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (1 / 4) [2002-12-19 20:30:44]
Current Protocols Library
Toby. J. Gibson
European Molecular Biology Laboratory
Heidelberg, Germany
Elizabeth A. Greene
Fred Hutchinson Cancer Research Center
Seattle, Washington
Roderic Guigo
Universitat Pompeu Fabra
Barcelona, Spain
Midori A. Harris
Wellcome Trust Genome Campus
Cambridge, United Kingdom
Matthew Healy
Bristol-Myers Squibb Pharmaceutical Research Institute
Wallingford, Connecticut
Jorja G. Henikoff
Fred Hutchinson Cancer Research Center
Seattle, Washington
Steven Henikoff
Fred Hutchinson Cancer Research Center
Seattle, Washington
Des G. Higgins
University College
Cork, Ireland
D. Curtis Jamison
George Mason University
Manassas, Virginia
Inge Jonassen
University of Bergen MolMine AS
Bergen, Norway
Istvan Ladunga
Celera Genomics
Foster City, California and
Research Group for Evolutionary Genetics
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (2 / 4) [2002-12-19 20:30:44]
Current Protocols Library
Hungarian Academy of Sciences
Eotvos University
Budapest, Hungary
Shonda Leonard
Rockville, Maryland
Juliane Murphy
National Human Genome Research Institute
National Institutes of Health
Bethesda, Maryland
Roderic D.M. Page
University of Glasgow
Glasgow, Scotland
Genis Parra
Universitat Pompeu Fabra
Barcelona, Spain
Mihaela Pertea
The Institute for Genomic Research
Rockville, Maryland
Shmuel Pietrokovski
Weizmann Institute of Science
Rehovot, Israel
Steven L. Salzberg
The Institute for Genomic Research
Rockville, Maryland
Lincoln D. Stein
Cold Spring Harbor Laboratory
Cold Spring Harbor, New York
Gary D. Stormo
Washington University School of Medicine
St. Louis, Missouri
Nick Taylor
Fred Hutchinson Cancer Research Center
Seattle, Washington
Julie D. Thompson
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (3 / 4) [2002-12-19 20:30:44]
Current Protocols Library
Institut de Genetique et de Biologie Moleculaire et Cellulaire
Illkirch Cedex, France
David Wheeler
Human Genome Center
Baylor College of Medicine
Houston, Texas
Michael Q. Zhang
Cold Spring Harbor Laboratory
Cold Spring Harbor, New York
From Current Protocols in Bioinformatics Online
Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (4 / 4) [2002-12-19 20:30:44]
Current Protocols Library
CURRENT PROTOCOLS IN BIOINFORMATICS
CHAPTER 1 USING BIOLOGICAL DATABASES
UNIT 1.1 The Importance of Biological Databases in Biological Discovery
CONTRIBUTORS
CHAPTER 1 USING BIOLOGICAL
DATABASES
UNIT 1.1 The Importance of Biological Databases in
Biological Discovery
CONTRIBUTORS
Contributed by Andreas D. Baxevanis
National Human Genome Research Institute
National Institutes of Health
Bethesda, Maryland
Published Online: August 2002
From Current Protocols in Bioinformatics Online
Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.
ryId=0&matchNum=0&getSearchResults=0-0&numMatches=0 [2002-12-19 20:30:49]
Current Protocols Library
CURRENT PROTOCOLS IN BIOINFORMATICS
CHAPTER 1 USING BIOLOGICAL DATABASES
UNIT 1.1 The Importance of Biological Databases in Biological Discovery
INTRODUCTION
INTRODUCTION
In April 2003, the biological community will celebrate the completion of
the Human Genome Project's major goal, the complete, accurate, and
high-quality sequencing of the human genome (
Collins et al., 1998). The
attainment of this goal, which many have compared to landing a man on
the moon, will obviously have a profound effect on how biological and
biomedical research will be conducted in the future. The free availability
of not just human genome data, but human sequence variation data,
model organism sequence data, and information on gene structure and
function provides fertile ground for the biologist to better design and
interpret their experiments in the laboratory, fulfilling the promise of
bioinformatics in advancing and accelerating biological discovery.
The database that most biologists are familiar with is GenBank, the
annotated collection of all publicly available DNA and protein
sequences. This database, maintained by National Center for
Biotechnology Information (NCBI) at the National Institutes of Health,
represents a collaborative effort between NCBI, the European Molecular
Biology Laboratory (EMBL), and the DNA Data Bank of Japan (DDBJ).
At the time of this writing, GenBank contained >17 billion nucleotide
bases, representing >14 million sequences in 100,000 species. The
effect of the Human Genome Project and other systematic sequencing
projects on the accumulation of sequence data is best illustrated by the
growth of GenBank, as shown in
Figure 1.1.1. The number of bases in
GenBank doubles every 14 months, and this exponential growth rate is
expected to continue for some time to come, even with the completion of
human genome sequencing.
The growth curve is included here to demonstrate the magnitude of the
data available to the user and, more importantly, the inherent potential in
being able to effectively and efficiently navigate through these data.
GenBank, or any other biological database for that matter, serves little
purpose unless the data can be easily searched and entries retrieved in
a usable, meaningful format. Otherwise, sequencing efforts such as
those described above have no useful end, since the biological
community as a whole cannot make use of the information hidden within
these millions of bases and amino acids. Much effort has gone into
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (1 / 2) [2002-12-19 20:30:54]
Current Protocols Library
making such data accessible to the biologist, and the programs and
interfaces resulting from these efforts are the focus of this chapter. The
chapter will provide coverage not only of GenBank and associated
databases, but of the major portals containing human and model
organism data as well. In most cases, the editors have called upon the
people actually involved in developing and maintaining these databases
in order to provide the readers with the most up-to-date view of the
content and functionality of these public resources.
While GenBank has been used as a specific example in this
introduction, the range of publicly available biological data goes far
beyond what is included in that one database. Since the major public
sequence databases need to be able to store data in a generalized
fashion, often times these databases do not contain more specialized
types of information that would be of interest to specific segments within
the biological community. To address this, many smaller, specialized
databases have emerged, developed and curated by biologists "in the
trenches" to fulfill specific needs. These databases, which contain
information ranging from strain crosses to gene expression data, provide
a valuable adjunct to the more visible public sequence databases, and
the user is encouraged to make intelligent use of both types of
databases in their searches. An annotated list of such databases can be
found in the yearly Database Issue of Nucleic Acids Research
(
Baxevanis, 2002), and references to these databases will be included
within this chapter as appropriate.
The position of this chapter at the beginning of Current Protocols in
Bioinformatics reflects the editors' belief that information retrieval from
biological databases provides the first step in being able to perform
robust and accurate bioinformatic analyses. The user is strongly
encouraged to work through the examples presented in this chapter and
understand how to find sequence data of interest as a basis for the more
advanced analyses presented in this work.
From Current Protocols in Bioinformatics Online
Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (2 / 2) [2002-12-19 20:30:54]
Current Protocols Library
CURRENT PROTOCOLS IN BIOINFORMATICS
CHAPTER 1 USING BIOLOGICAL DATABASES
UNIT 1.1 The Importance of Biological Databases in Biological Discovery
LITERATURE CITED
LITERATURE CITED
Baxevanis, A.D. 2002. The molecular biology database collection: 2002
update. Nucleic Acids Res. 30:1-12.
Collins, F.S., Patrinos, A., Jordan, E., Chakravarti, A., Gesteland, R.,
Walters, L., and Members of the DOE and NIH Planning Groups. 1998.
New goals for the U.S. Human Genome Project: 1998-2003. Science
282:682-689.
From Current Protocols in Bioinformatics Online
Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.
ryId=0&matchNum=0&getSearchResults=0-0&numMatches=0 [2002-12-19 20:31:01]
Current Protocols Library
CURRENT PROTOCOLS IN BIOINFORMATICS
CHAPTER 1 USING BIOLOGICAL DATABASES
UNIT 1.1 The Importance of Biological Databases in Biological Discovery
FIGURE(S)
Printing images is not supported by this browser. To print images, select
update and download the latest version of your browser.
Figure 1.1.1 Exponential growth of GenBank. Data obtained from the NCBI
Web site. Note that the period of accelerated growth after 1997 coincides
with the completion of the HGP's genetic and physical mapping goals,
setting the stage for systematic high-accuracy, high-throughput sequencing,
as well as the development of new sequencing technologies (
cf. Collins et
al., 1998).
From Current Protocols in Bioinformatics Online
Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.
ryId=0&matchNum=0&getSearchResults=0-0&numMatches=0 [2002-12-19 20:31:07]
Current Protocols Library
CURRENT PROTOCOLS IN BIOINFORMATICS
CHAPTER 1 USING BIOLOGICAL DATABASES
UNIT 1.2 Searching Online Mendelian Inheritance in Man (OMIM) for Information for Genetic Loci Involved in Human Disease
CONTRIBUTORS AND INTRODUCTION
UNIT 1.2 Searching Online Mendelian Inheritance in
Man (OMIM) for Information for Genetic Loci
Involved in Human Disease
CONTRIBUTORS AND INTRODUCTION
Contributed by Andreas D. Baxevanis
National Human Genome Research Institute
National Institutes of Health
Bethesda, Maryland
Published Online: August 2002
Online Mendelian Inheritance in Man (OMIM) is a nonsequence-based
information resource that can be of tremendous use to genomics
researchers, physicians, and patients. OMIM is the electronic version of
the catalog of human genes and genetic disorders founded and
developed by Victor McKusick and colleagues at Johns Hopkins
University (
McKusick, 1998; Hamosh et al., 2002). It provides concise
textual information from the literature on most human conditions having
a genetic basis, as well as pictures illustrating the condition or disorder
(where appropriate) and full citation information. Since the online version
of OMIM is housed at NCBI, links to Entrez are provided from all cited
references within each OMIM entry.
There are two main ways in which a user can search the OMIM
database. One may choose to search the OMIM database directly from
the NCBI home page (see
Basic Protocol). Alternatively, OMIM can be
downloaded and run on any internal site where the user may want to
keep from submitting data across the Web or where a local installation
would be otherwise advantageous (see Alternate Protocol).
From Current Protocols in Bioinformatics Online
Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.
ryId=0&matchNum=0&getSearchResults=0-0&numMatches=0 [2002-12-19 20:31:19]
Current Protocols Library
CURRENT PROTOCOLS IN BIOINFORMATICS
CHAPTER 1 USING BIOLOGICAL DATABASES
UNIT 1.2 Searching Online Mendelian Inheritance in Man (OMIM) for Information for Genetic Loci Involved in Human Disease
BASIC PROTOCOL: SEARCHING OMIM OVER THE INTERNET
BASIC PROTOCOL: SEARCHING OMIM OVER THE
INTERNET
OMIM may be accessed directly from the NCBI home page
(
) by clicking on the OMIM link in the blue
bar at the top of the page. This protocol describes accessing the Web
site and entering search terms to retrieve OMIM records. It then briefly
reviews the format of an OMIM record and guides the user through the
numerous hyperlinks that are available.
The search term "synuclein" will be used as an example throughout this
protocol.
Necessary Resources
Hardware
Any Internet-connected computer
Software
Current Internet browser (e.g., Microsoft Internet Explorer, Netscape
Navigator)
Files
None required
Performing an OMIM search
1. Open the browser and go to the NCBI home page
(
).
2. Change the search pull-down from GenBank to OMIM. Enter the
search term or terms into the text box, which may be coupled by
Boolean operators such as "AND," "OR," or "NOT." Each search term
can, in turn, be qualified so that it is compared only to particular parts of
the OMIM record. Once you have entered the search terms, submit the
search by pressing the Go button, or by hitting Enter on the keyboard.
Consider the case where one wants to retrieve all of the entries involving
the SNCA gene in Parkinson's disease. Within the text box to the right,
one would simply type:
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (1 / 5) [2002-12-19 20:31:25]
Current Protocols Library
SNCA [GENE] AND PARKINSON [DIS]
The [GENE] qualifying the first term indicates to OMIM that this is a
gene name. The [DIS] qualifying the second term indicates that this is
the name of a Gene Map Disorder. A list of the qualifiers that can be
used in formulating a search is shown in
Table 1.2.1.
At the time of this writing, the query returns two entries in this case: one
for Parkinson Disease, Familial, Type 1 (#601508), the second for
Synuclein, Alpha (*163890), as shown in
Figure 1.2.1. The search would
produce two entries regardless of the order of the search terms. The
numbers above each description are the OMIM accession numbers for
these entries; their significance is described below (see Understanding
the Database Record).
Several useful links can be found to the right of each of the accession
numbers in this view. The Nucleotide link takes the user to GenBank
(
APPENDIX 1B), directly to the nucleotide entry for the sequence of the
gene of interest. The Protein link takes the user to the corresponding
protein entry for the gene of interest. Related Entries presents the user
with a list of all other OMIM entries that are related to the entry of
interest. The PubMed link takes the user directly to PubMed, showing all
relevant MEDLINE entries for the OMIM entry of interest.
Finally, there is a link under each found entry labeled Gene Map Locus.
This hyperlink takes the user directly to the OMIM Gene Map, which
presents the cytogenetic location and other relevant information about
each of the disease genes described within OMIM. The Gene Map is
described in a separate section, below.
3. Select the OMIM entry of interest by clicking on the corresponding
OMIM accession number. For this example, access the detailed OMIM
entry for alpha-synuclein by clicking on the hyperlinked accession
number (*163890). The top portion of the resulting detailed entry is
shown in
Figure 1.2.2.
4. Select how to view the OMIM entry. The pull-down menu beside the
Display button will allow the user to change between views; once the
desired view is selected from the pull-down menu, the user should then
click Display. The default display is Detailed. Each of the options are
described in
Table 1.2.2, but not all options will be available for each
OMIM entry. The user will immediately notice that there is some
redundancy built into the OMIM interface, in that certain pieces of
information can be found in more than one way. A complete description
of the detailed OMIM record can be found below (see Understanding the
Database Record section).
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (2 / 5) [2002-12-19 20:31:25]
Current Protocols Library
OMIM gene and morbid maps
5. From the detailed view of the entry (
Fig. 1.2.2), click on the Gene map
locus 4q21 link that appears beneath the alternate titles and symbols
near the top of the OMIM record (or on the hyperlink marked Gene Map
in the navigation bar on the left-hand side of the page). The OMIM Gene
Map presents the cytogenetic locations of the genes described in OMIM
having a published map location. The list begins at the p telomere of
chromosome 1 and continues through to the q telomere of chromosome
22. This is then followed by the genes found on the X and Y
chromosomes. The resulting gene map is shown in
Figure 1.2.3. The
header at the top of the table gives the details of the gene range
displayed in the chart, as well as the cytogenetic range displayed in the
chart.
For display purposes, the genes are shown in groups of 20. When the
gene shares the same cytogenetic location as another gene, they are
sorted by primary symbol. When there is only a chromosomal location
and no cytogenetic band location, the gene is listed at the end of the
chromosome. In this case, as shown in
Figure 1.2.3, the table begins
with the SNCA gene, and its cytogenetic location is listed as 4q21. The
table will always begin with the gene from which the OMIM gene map
was accessed. A complete description of the OMIM Gene Map can be
found below (see Understanding the Database Record).
6. From the OMIM Gene Map page, one can link to the OMIM Morbid
map by clicking on the link at the top of the page. The basic feature that
differentiates the Morbid Map from the Gene Map is that the Morbid Map
presents all listed genes in alphabetical rather than chromosomal order.
OMIM hyperlinks available to the left of an OMIM record
7. Return to the page showing the Display view for the synuclein entry
(
Fig. 1.2.2) by clicking the back button twice from the Morbid Map page.
In the left-hand frame, there are multiple hyperlinks that allow the user to
easily navigate through the detailed OMIM record. Specifically, the links
take users to the Description, Cloning, Gene Function, Mapping,
Molecular Genetics, Animal Model, Allelic Variants, References,
Contributors, Creation Date, and Edit History sections of the record.
Each of these subsections is described in the Understanding the
Database Record section below. A View List link just beneath the Allelic
Variants link takes users to a list of allelic variants, rather than a detailed
description of the variants.
8. The Gene map link offers users another route to the OMIM gene map
(see step 5).
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (3 / 5) [2002-12-19 20:31:25]
Current Protocols Library
LocusLink and LinkOut
9. Beneath the Gene map link is a hyperlink labeled LocusLink. Clicking
on this link brings the user to the relevant LocusLink page on the NCBI
Web site. LocusLink provides a single query interface to various types of
information regarding a given genetic locus, such as phenotypes, map
locations, and homologies to other genes. The LocusLink search space
currently includes information from humans, mice, rats, fruit flies, and
zebrafish. More information on LocusLink can be found in Baxevanis
and Ouellette (
2001).
10. Return to the page showing the Display view for the synuclein entry
(
Fig. 1.2.2).
11. At the bottom of the left-hand frame, there is a hyperlink labeled
LinkOut. Clicking on this link brings the user to the LinkOut resources
(
Fig. 1.2.4). LinkOut is an NCBI utility that is designed to provide users
direct connections to a wide variety of relevant external online
resources, including full-text publications, biological databases,
consumer health information, research tools, and more. The resulting
links are grouped into three categories: medical, molecular biology
databases, and "other." Some of the relevant links are discussed below.
The utility of LinkOut within the context of OMIM is best illustrated by
example.
a. Medical databases:
NCBI's Genes and Diseases. NCBI's Genes and Diseases database is
an extremely useful database for physicians, researchers and scientists
alike. This database is part of an ongoing effort to map and characterize
diseases caused by the mutation in one gene or a result of mutations in
several genes such as asthma and diabetes. The Genes and Disease
site linked to from the SNCA entry in OMIM (
Fig. 1.2.5) is made up of
two sections. The first section of note is located in the large middle panel
containing an overview of the disease. The second section, on the left,
contains additional links to information on SNCA and Parkinson's
Disease. The most relevant source of information from a clinical
standpoint is found in the Information subsection, towards the bottom of
the left-hand sidebar. This section includes links to general information
for clinicians, physicians and patients. It also includes the Medline Plus
feature (
Fig. 1.2.6) which when selected provides a link to the Clinical
Trials page (
Fig. 1.2.7).
b. Molecular biology databases:
Genome Database. The Genome Database (GDB) is the official central
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (4 / 5) [2002-12-19 20:31:25]
Current Protocols Library
repository for genomic mapping data resulting from the Human Genome
Initiative. The Human Genome Initiative is a worldwide research effort to
analyze the structure of human DNA and determine the location and
sequence of the human genes. In support of this project, GDB stores
and curates data generated worldwide by those researchers engaged in
the mapping effort of the Human Genome Project (HGP). The Synuclein
link to GDB displays all the information stored from GDB on SNCA (not
shown). The information displayed comprises alternate gene symbols,
the cytogenetic location of the gene and the resource used to map it,
nucleic acid links for the SNCA gene, protein links for the SNCA gene,
related amplimers and clones, polymorphisms, clones, phenotype and
homology links, and additional external links.
Cardiff Human Gene Mutation Database. The Cardiff Human Gene
Mutation Database (HGMD) site represents an attempt to collate known
published gene lesions responsible for human inherited diseases into a
comprehensive reference source. The Cardiff Human Gene Mutation
Database provides information of practical diagnostic importance to
researchers and diagnosticians in human molecular genetics, physicians
interested in a particular inherited condition in a given patient or family,
and genetic counselors. For SNCA, the database documents two
nonsense mutations, which contribute to Parkinson disease. The Web
site also offers hyperlinks to mutation maps, the cDNA native sequence,
and the SNCA entries in the genome database (GDB), GenAtlas, and
OMIM.
c. Other databases:
Jackson Laboratory Mouse Genome Database. The Jackson Laboratory
Mouse Genome Database includes data on gene characterization and
nomenclature, mapping, gene homologies among mammals, sequence
links, phenotypes, allelic variants and mutants, and strain data.
Figure
1.2.8 shows the data displayed by following the link for the SNCA gene
from the OMIM database to MGD. This leads the user to the ortholog
SNCA in the mouse. The Mouse Genome Database provides
chromosomal location, alternate names, polymorphism information, and
mammalian homologies of the gene in the OMIM database.
From Current Protocols in Bioinformatics Online
Copyright © 2002 John Wiley & Sons, Inc. All rights reserved.
d=0&matchNum=0&getSearchResults=0-0&numMatches=0 (5 / 5) [2002-12-19 20:31:25]