Bioinformatics and Functional Genomics

Chapter: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | App 1 | App 2


Chapter 8: Protein analysis and proteomics


Web resources from Chapter 8
Website URL
InterPro http://www.ebi.ac.uk/interpro/
InterPro definitions and terms http://www.ebi.ac.uk/interpro/user_manual.html
SMART database glossary http://smart.embl-heidelberg.de/help/smart_glossary.shtml
The ExPASy sequence retrieval system (SRS) http://www.expasy.ch/srs5/
PROSITE http://www.expasy.org/prosite/
The Gene Ontology website http://www.geneontology.org/
Evidence codes for the Gene Ontology project http://www.geneontology.org/doc/GO.evidence.html
Fluorescence micrograph database from Kumar et al. http://ygac.med.yale.edu
The COGS database http://www.ncbi.nlm.nih.gov/COG/
Gavin et al. data http://yeast.cellzome.com
BIND (Biomolecular Interaction Network Database) http://www.bind.ca/
Ito et al. data http://genome.c.kanazawa-u.ac.jp/
The EcoCyc pathway database http://ecocyc.org
The MetCyc pathway database http://metacyc.org/
KEGG http://www.genome.ad.jp/kegg/
The DRAGON database www.dragondb.org

 

Tables

Table 8-1. Definitions from the InterPro database of protein families and related terms (adapted from http://www.ebi.ac.uk/interpro/user_manual.html).
Term Definition
Family An InterPro family is a group of evolutionarily related proteins that share one or more domains/repeats in common. A InterPro entry of type=family may contain a signature for a small conserved region that is representative of the family, and need therefore not necessarily cover the whole protein.
Domain An InterPro domain is an independent structural unit which can be found alone or in conjunction with other domains or repeats. Domains are evolutionarily related. An InterPro entry of the type=domain is diagnostic for a domain but not necessarily define domain boundaries exactly.
Repeat An InterPro repeat is a region that is not expected to fold into a globular domain on its own. For example 6-8 copies of the WD40 repeat are needed to form a single globular domain. There also many other short repeat motifs that probably do not form a globular fold that have type=repeat.
Post-translational modification A post-translational modification includes for example, an N glycosylation site. The sequence motif is defined by the molecular recognition of this region in a cell. This may group together proteins that need not be evolutionarily related.
 
Table 8-2. Definitions of protein domains and motifs from the SMART database (adapted from http://smart.embl-heidelberg.de/help/smart_glossary.shtml). SMART is a tool to allow automatic identification and annotation of domains in user-supplied protein sequences (see Chapter 10).
Term Definition
Domain Conserved structural entities with distinctive secondary structure content and a hydrophobic core. In small disulphide-rich and Zn2+-binding or Ca2+- binding domains the hydrophobic core may be provided by cystines and metal ions, respectively. Homologous domains with common functions usually show sequence similarities.
Domain composition Proteins with the same domain composition have at least one copy of each of domains of the query.
Domain organization Proteins having all the domains as the query in the same order (additional domains are allowed).
Motif Sequence motifs are short conserved regions of polypeptides. Sets of sequence motifs need not necessarily represent homologues.
Profile A profile is a table of position-specific scores and gap penalties, representing an homologous family, that may be used to search sequence databases (Bork and Gibson, 1996).
 
Table 8-3. Fifteen most common domains of Homo sapiens. From the European Bioinformatics Institute (EBI) proteome analysis site (http://www.ebi.ac.uk/proteome/)(August, 2002), based upon the InterPro database (http://www.ebi.ac.uk/interpro/index.html).
InterPro ID Matches per genome Number of proteins Name
IPR000822 30034 1093 Zn-finger, C2H2 type
IPR003006 2631 1032 Immunoglobulin/major histocompatibility complex
IPR000561 4985 471 EGF-like domain
IPR001841 1356 458 Zn-finger, RING
IPR001356 2542 417 Homeobox
IPR001849 1236 405 Pleckstrin-like
IPR000504 2046 400 RNA-binding region RNP-1 (RNA recognition motif)
IPR001452 2562 394 SH3 domain
IPR002048 2518 392 Calcium-binding EF-hand
IPR003961 2199 300 Fibronectin, type III
IPR001478 1398 280 PDZ/DHR/GLGF domain
IPR005225 261 261 Small GTP-binding protein domain
IPR000210 583 236 BTB/POZ domain
IPR001092 713 226 Basic helix-loop-helix dimerization domain bHLH
IPR002126 5168 226 Cadherin
 
Table 8-4. Some physical properties of proteins. Abbreviations: G protein, GTP-binding protein; nAChR, nicotinic acetylcholine receptor.
property Classical method Example
amino acid motifs   PDZ domain (e.g. nitric oxide synthase), coiled coil domain (e.g. hemagglutinin, syntaxin, SNAP-25, myosin)
isoelectric point (pI) derived from isoelectric focusing  
molecular weight derived from Stokes radius and sedimentation coefficient  
posttranslational modifications: phosphorylation Enzymatic analyses synapsin
posttranslational modifications: glycosylation Enzymatic analyses nerve growth factor, neural cell adhesion molecule
posttranslational modifications: isoprenylation   lamin B, G protein g subunits, rab3A
posttranslational modifications: palmitoylation   b-adrenergic receptor, GAP-43, insulin receptor, rhodopsin, nAChR
posttranslational modifications: myristoylation   PKA, Gia-subunit, MARCKS protein, calcineurin
posttranslational modifications: GPI-anchored proteins Enzymatic analyses alkaline phosphatase, thy-1, prion protein, 5’-nucloetidase, uromodulin
sedimentation coefficient derived from sucrose density gradients  
Stokes radius derived from gel filtration  
transmembrane domain derived from subcellular fractionation  
 
Table 8-5. Participating organizations and databases in the Gene Ontology Consortium. Adapted from http://www.geneontology.org/.
Database or organization Organism Common name
Berkeley Drosophila Genome Project Drosophila melanogaster Fly
Compugen    
DictyBase Dictyostelium discoideum slime mold
European Bioinformatics Institute (EBI) Various  
FlyBase Drosophila melanogaster Fly
Genome Knowledge Base (GKB) at Cold Spring Harbor Laboratory Various  
Gramene Oryza sativa; other grains, monocots Rice
Mouse Genome Database (MGD) & Gene Expression Database (GXD) Mus musculus Mouse
Pathogen Group at the Wellcome Trust Sanger Institute Various  
PomBase Schizosaccharomyces cerevisiae Fission yeast
Rat Genome Database (RGD) Rattus Rat
Saccharomyces Genome Database (SGD) Saccharomyces cerevisiae Baker’s yeast
The Arabidopsis Information Resource (TAIR) Arabidopsis thaliana Thale cress
The Institute for Genomic Research (TIGR) Various  
WormBase Caenorhabditis elegans Worm

 

Table 8-6. Web sites useful to access Gene Ontology data.
Browser Description
AmiGO from BDGP A GO browser from the Gene Ontology Consortium
MGI GO Browser From the Mouse Genome Informatics web site at the Jackson Laboratories
QuickGO” at EBI From the EMBL and European Bioinformatics Institute; integrated with InterPro (Chapter 10)
EP GO Browser A GeneOntology browser and analysis tool that is part of the Expression Profiler suite at the European Bioinformatics Institute
CGAP GO Browser From the Cancer Gene Anatomy Project at the NIH
 
Table 8-7. Evidence codes for the Gene Ontology project. Adapted from http://www.geneontology.org/.
Abbreviation Evidence code Example(s)
IC Inferred by curator A protein is annotated as having the function of a “transcription factor.” A curator may then infer that the localization is “nucleus”
IDA Inferred from direct assay An enzyme assay (for function); immunofluorescence microscopy (for cellular component)
IEA Inferred from electronic annotation Annotations based on “hits” in searches such as BLAST (but without confirmation by a curator; compare ISS)
IEP Inferred from expression pattern Transcripts levels (e.g. based on Northern blotting or microarrays) or protein levels (e.g. from Western blots)
IGI Inferred from genetic interaction Suppresors; genetic lethals; complementation assays; experiments in which one gene provides information about the function, process, or component of another gene
IMP Inferred from mutant phenotype Gene mutation; gene knockout; overexpression; antisense assays
IPI Inferred from physical interaction Yeast two-hybrid assays; copurification; co-immunoprecipitation; binding assays
ISS Inferred from sequence or structural similarity Sequence similarity; domains; BLAST results that are reviewed for accuracy by a curator
NAS Non-traceable author statement Database entries such as a SwissProt record that does not cite a published paper
ND No biological data available Corresponds to “unknown” molecular function, biological process, or cellular compartment
TAS Traceable author statement Information in a review article or dictionary
 
Table 8-8. Functional assignment of proteins based upon their enzymatic activity: partial list of the Enzyme Commssion (EC) classification system (release 27.0, October 2001). From http://kr.expasy.org/cgi-bin/enzyme-search-cl.
EC number Description of class # enzymes Example of subclass
1.-.-.- Oxidoreductases 1,003  
1.1.-.-     Acting on the CH-OH group of donors
1.2.-.-     Acting on the aldehyde or oxo group of donors
2.-.-.- Transferases 1,076  
2.1.-.-     Transferring one-carbon groups
3.-.-.- Hydrolases 1,125  
4.-.-.- Lyases 356  
5.-.-.- Isomerases 156  
6.-.-.- Ligases 126  
 
Table 8-9. Functional classification of proteins in the Clusters of Orthologous Groups (COGs) database (http://www.ncbi.nlm.nih.gov/COG/)(September, 2002).
General category Function COGs domains
Information storage and processing      
  Translation, ribosomal structure and biogenesis 217 6,449
  Transcription 133 5,442
  DNA replication, recombination and repair 184 5,337
Cellular processes      
  Cell division and chromosome partitioning 32 842
  Posttranslational modification, protein turnover, chaperones 109 3,155
  Cell envelope biogenesis, outer membrane 155 4,079
  Cell motility and secretion 133 3,110
  Inorganic ion transport and metabolism 160 5,112
  Signal transduction mechanisms 96 3,623
Metabolism      
  Energy production and conversion 223 5,584
  Carbohydrate transport and metabolism 170 5,257
  Amino acid transport and metabolism 233 8,383
  Nucleotide transport and metabolism 85 2,364
  Coenzyme metabolism 154 4,057
  Lipid metabolism 75 2,609
  Secondary metabolites biosynthesis, transport and catabolism 62 2,754
Poorly characterized      
  General function prediction only 449 11,948
  Function unknown 752 6,431
 
Table 8-10. Main categories of metabolic and regulatory pathways in the KEGG database (http://www.genome.ad.jp/kegg/). Hundreds of pathway maps are available within these categories.
Metabolic Pathways  
  Carbohydrate Metabolism
  Energy Metabolism
  Lipid Metabolism
  Nucleotide Metabolism
  Amino Acid Metabolism
  Metabolism of Other Amino Acids
  Metabolism of Complex Carbohydrates
  Metabolism of Complex Lipids
  Metabolism of Cofactors and Vitamins
  Biosynthesis of Secondary Metabolites
  Biodegradation of Xenobiotics
Regulatory Pathways: Genetic Information Processing  
  Transcription
  Translation
  Sorting and Degradation
  Replication and Repair
Regulatory Pathways: Environmental Information Processing  
  Membrane Transport
  Signal Transduction
  Ligand-Receptor Interaction
Regulatory Pathways: Cellular Processes  
  Cell Motility
  Cell Growth and Death
  Cell Communication
  Development
  Behavior
Regulatory Pathways: Human Diseases  
  Neurodegenerative Disorders

 

Table 8-10. Tools to analyze protein motifs. From ExPASy (http://www.expasy.org/tools/)
Program Comment
InterProScan At EBI
ppsearch At EBI
PRATT At EBI
ProfileScan Server At ISREC
PROSCAN (PROSITE SCAN) At PBIL
ScanProsite tool At ExPASy
SMART At EMBL
TEIRESIAS At IBM
 
Table 8-11. Web-based programs for the prediction of protein localization.
Program Comment
ChloroP predicts the presence of chloroplast transit peptides (cTP) in protein sequences
MITOPROT calculates the N-terminal protein region that can support a mitochondrial targeting sequence and the cleavage site
PSORT prediction of protein sorting signals and localization sites
SignalP predicts the presence and location of signal peptide cleavage sites in prokaryotes and eukaryotes.
TargetP predicts the subcellular location of eukaryotic protein sequences.
 
Table 8-12. Web servers for the prediction of transmembrane domains in protein sequences. Source: ExPASy web server. K. Hofmann & W. Stoffel (1993) Tmbase - A database of membrane spanning proteins segments Biol. Chem. Hoppe-Seyler 374,166
Program Comment/source
DAS server  
HMMTOP Prediction of transmembrane helices and topology of proteins
PredictProtein server Prediction of transmembrane helix location and topology
SOSUI Classification and Secondary Structure Prediction of Membrane Proteins
TMpred  
TMHMM (v. 2.0) Center for Biological Sequence Analysis, Technical University of Denmark
TopPred2 Topology prediction of membrane proteins
 
Table 8-13. Tools to analyze primary structure features of proteins. From ExPASy (http://www.expasy.org/tools/)
Program Source/comment
COILS Prediction of coiled coil regions in proteins
Compute pI/Mw From ExPASy
drawhca Hydrophobic Cluster Analysis plot
Helical Wheel draws an helical wheel, i.e. an axial projection of a regular alpha-helix, for a given sequence
M.M., pI, composition, titrage MW, pI, Titration curve
Paircoil Prediction of coiled coil regions in proteins
PeptideMass From ExPASy
REP Searches a protein sequence for a repeats
SAPS Statistical Analysis of Protein Sequences
 
Table 8-14. Web resources for the characterization of glycosylation sites on proteins.
Program Comment/source
DictyOGlyc 1.1 Prediction Server neural network predictions for GlcNAc O-glycosylation sites in Dictyostelium discoideum proteins
NetOGlyc Prediction of type O-glycosylation sites in mammalian proteins
YinOYang 1.2 produces neural network predictions for O-ß-GlcNAc attachment sites in eukaryotic protein sequences
 
Table 8-15.Tools to analyze postranslational modifications. From ExPASy (http://www.expasy.org/tools/)
Program Comment
big-PI Predictor GPI Modification Site Prediction
DGPI Detection/prediction of GPI cleavage site (GPI-anchor) in a protein
NetPhos 2.0 Prediction Server produces neural network predictions for serine, threonine and tyrosine phosphorylation sites in eukaryotic proteins
Sulfinator Prediction of tyrosine sulfation sites
 
Table 8-16. Examples of proteins with unusually high occurrences of specific amino acids (modified from Ponting, 2001). The hydrophobic residues characteristic of transmembrane helices are from Tanford (1980).
Amino acid(s) Proteins
C Disulfide-rich proteins; metallothioneins; zinc finger proteins
D, E Acidic proteins
G Collagens
H Hisactophilin; histidine-rich glycoprotein
W, L, P, Y, L, V, M, A Transmembrane domains
K, R Nuclear proteins (nuclear localization signals)
N Dictyostelium proteins
P Collagens; filaments; SH3/WW/EVHI binding sites
Q Proteins encoded by genes mutated in triplet repeat disorders (Chapter 17)
S, R Some RNA-binding motifs
S, T Mucins; oligosaccharide attachment sites
abcdefg Heptad coiled coils (a and d are hydrophobic residues) e.g. myosins
 
Table 8-17. Web-based databases of protein-protein and protein-ligand interactions.
http://dip.doe-mbi.ucla.edu/dip/Search.cgi. Some of these databases are listed in Schächter (2002).
Database Comment
BIND (The Biomolecular Interaction Network Database) database designed to store full descriptions of interactions, molecular complexes and pathways.
Cellzome  
DIP (The Database of Interacting Proteins)  
DLRP (Database of Ligand-Receptor Partners) database of protein ligand and protein receptor pairs that are known to interact with each other
FlyBase See Jacq (2001)
FlyNets See Jacq (2001)
KEGG (Kyoto Encyclopedia of Genes and Genomes)  
GeNet (Gene Networks database) See Jacq (2001)
ProNet From Doubletwist, Inc. and Myriad Genetics
STKE (Signal Transduction Knowledge Environment) See Jacq (2001)
Transfac (Transcription factor database)  
YPD, PombePD, WormPD Proteome, Inc. databases
 
 
 

Return to Contents