Bioinformatics and Functional Genomics

Chapter: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | App 1 | App 2


Appendix 2: GCG for protein and DNA analysis


Outline summary

TOPIC 1: INTRODUCTION TO GCG

TOPIC 2: GENERAL COMMANDS

TOPIC 3: EDITING SEQUENCES

TOPIC 4: PAIRWISE ALIGNMENT

TOPIC 5: MULTIPLE SEQUENCE ALIGNMENT

TOPIC 6: PHYLOGENY

TOPIC 7: SEQUENCE ANALYSIS


One-page summary sheet of helpful commands

General commands (UNIX stuff):
cat x shows the document x
cat | more shows the document, one page at a time (spacebar to advance a page)
ls lists the files in a directory
ls *.pep lists all the files in a directory that end with the suffix .pep
^c interrupt what you’re doing and return to the command line
^s pause the output on the screen
^q resume the output

 

General commands (GCG):
mkdir create a new directory
mkdir a.dir create a new directory named a.dir
down a.dir move from the home directory down to a.dir
over b.dir move over to the directory called b.dir
genhelp get help on all the 120 programs within GCG
genhelp gap get help on the program called gap
rm x delete a file called x
rm *z* delete all the files with a z in the name
-che check parameters

 

Entering and editing sequences:
seqed x.pep make (or edit) a protein file called x
seqed y.seq
make (or edit) a DNA file called y
exit end a seqed session and save your changes ^D
quit end a seqed session without saving changes
include paste in portion of another DNA sequence
help get help on sequed
onecase make sequences all upper or lowercase
onecase *.pep make all protein sequences one case

 

 

Comparisons
bestfit compare two proteins or DNA sequences; focus on a local alignment
gap compare two proteins or DNA sequences; focus on a global alignment
tfasta compares one protein to a DNA query translated in all 6 possible frames
pileup compare up to hundreds of sequences in a multiple sequence alignment

 

Sequence analysis
prime design oligonucleotide primers
map make a map of a DNA sequence; show restriction enzymes
map -six make a map showing restriction enzyme six-cutters
motifs x.pep scan the prosite dictionary for motifs and domains present in protein x

Main web pages

TOPIC 1: INTRODUCTION TO GCG [top]

The GCG package is a collection of >130 interrelated software programs used to analyze protein, RNA, and DNA sequences. It is a commercial product that is similar in scope to Vector NTI (by Informax) and to the freely available, UNIX-based tools of Emboss (http://www.hgmp.mrc.ac.uk/Software/EMBOSS/).


“GCG® (Genetics Computer Group) serves molecular biologists by building practical tools that implement the most important techniques of mathematical biology. Founded in the Department of Genetics at the University of Wisconsin-Madison in 1982, GCG continued as a service of the UW Biotechnology Center from 1985 to 1989. Now a subsidiary of Oxford Molecular, GCG remains committed to the development and support of sequence analysis software as well as relational database management tools for storing, mining, and visualizing sequence data.

GCG's flagship product, the Wisconsin Package is the industry standard for sequence analysis, containing over 130 interrelated software programs to analyze nucleic acid and protein sequences. It is used by more than 40,000 scientists at over 650 institutions around the world. Since the product's inception, scientists from all over the world have collaborated to develop and refine the Wisconsin Package.”

 

GCG topics include:  
* Comparison * Importing and Exporting
* Database Searching * Mapping
* DNA/RNA Secondary Structure * Primer Selection
* Editing and Publication * Protein Analysis
* Evolution * SeqLab
* Fragment Assembly * Translation
* Gene Finding and Pattern Recognition * Utilities

TOPIC 2: GENERAL COMMANDS [top]

General commands (UNIX)
When you use the GCG suite of programs you are in a UNIX environment. Some useful commands follow:

> cat x shows the document x
> more x shows the document x, one page at a time (press spacebar to advance)
> ls lists the files in a directory
> ls *.pep lists all the files in a directory that end with the suffix .pep
> ^C interrupt what you’re doing and return to the command line
> ^S pause the output on the screen
> ^Q resume the output
> man on-line help manual

 

General commands (GCG)

Within the world of GCG, there are several important commands to know.

Getting help. For any program you’re working on, it’s very useful to read through the help menu. Typically, this menu provides features such as

• an overview of what the program does

• an example of how to actually use it

• a description of the algorithm used, together with literature references

• a comparison of what this program does to what related GCG programs can do

> genhelp get help on all the 130 programs within GCG

> genhelp gap get help on the program called gap

Organizing your files into directories. One can quickly accumulate dozens (or hundreds) of GCG files. A good way to organize your files is to make a separate directory for each project you are working on. When you want to make a directory simply type mkdir followed by the name fo the new directory. Then to get to that directory in order to work in it, type “down a.dir.” You can move between directories laterally by typing “over any.dir.” To get back to your main, starting directory you can always type “home.” And if you want to confirm where you are at any time, you can always type “ls” to list whatever files you have and to see the name of your current directory or subdirectory.

> mkdir create a new directory home
> mkdir a.dir create a new directory called a.dir
> down a.dir move from the home directory down to a.dir
> over b.dir move over to the directory called b.dir
> up move up to a higher directory
> home go to your main, home directory

Naming files.
A handy rule is to name any protein file you work with something with the suffix “.pep” (for peptide; for example, actin.pep). Name any DNA file with the suffix “.seq” (for sequence; for example, actin.seq). You’ll typically get the chance to name a file when you first create it in the main sequence editor, seqed (see below). The GCG programs like to see names such as a.pep or a.seq so they know that the appropriate data are being entered into a program (e.g. they want .seq files for restriction enzyme analyses, or .pep files for hydropathy profiles). There are some exceptions; for example, a DNA sequence that has been flipped to show you bottom strand sequence has the suffix something.rev.

Deleting, renaming, and copying files.
When you are deleting files, be very careful! It’s easy to delete, and there’s no turning back. Type > rm x to remove a file called x. (rm stands for remove)

> rm x delete a file called x
> rm *z* delete all the files with a z in the name
> rm *.pep delete all your peptide/protein files with the suffix .pep

 

To rename a file, use the mv (move) command.
> mv somename.pep bettername.pep will rename a file ‘somename.pep’ as a file called ‘bettername.pep’

To copy a file, use the cp (copy) command.
> cp *.pep ~/somedirectory/ will copy all your files containing the suffix .pep to your a directory called “somedirectory”

Note that the tilda ~ represents your home directory.


TOPIC 3: ENTERING AND EDITING SEQUENCES [top]


The main feature of GCG is its ability to manipulate sequences. In most cases, you need to begin by importing some sequence into GCG. Typical examples of what you’d import include the following:
• you get some DNA sequence from the Core Facility on your computer, and you want to know how well it matches a known DNA sequence or a known protein sequence.
• you directly sequence a protein and want to analyze its properties.
• someone sends you a recombinant fusion protein and you want to know its properties.
• you need to know if two proteins are significantly homologous or not.

In many cases, you will begin by finding a protein or DNA sequence through the National Center for Biotechnology Information (NCBI) or SWISS-PROT (Chapter 2).

Copy your DNA or protein sequence, then paste it into GCG using the sequence editor seqed.
> seqed x.pep make (or edit) a protein file called x.pep
> seqed y.seq make (or edit) a DNA file called y.seq
You find yourself now in the world of seqed. On the page, there are three levels: a header up top, a sequence buffer in the middle, and a command line editor down below. Type ^D (control D) to move down to the sequence buffer, then “paste” in your sequence. Once this
: exit end a seqed session and save your changes
: quit end a seqed session without saving changes
: include paste in portion of another DNA sequence
: help get help on seqed

Here is an example of how to enter human retinol-binding protein (RBP) into GCG.

In GCG, type the following (hsrbp stands for Homo sapiens retinol-binding protein):
> seqed hsrbp.pep <return>

Note that the GCG cursor is now in the top header region:

In Netscape, open the GenBank protein record for RBP. Copy the accession number and any other relevant information into the header region. When you are finished, press ^D to enter the main sequence buffer. Copy and paste the amino acid sequence of RBP from Netscape into GCG (note that GCG will ignore numbers that are associated with the letters of protein or nucleic acid sequences):

You have successfully created the file hsrbp.pep.
Type > ls to confirm that this file has appeared in your directory.
Type > cat hsrbp.pep to see this file displayed:

> cat hsrbp.pep

Optionally, you can use the following commands:

> onecase make sequences all upper or lowercase
> onecase *.pep make all protein sequences one case

In this handout, we will also use the example of syntaxins. These are proteins that bind vesicle proteins and coordinate the docking of vesicles with the appropriate target membrane.


TOPIC 4: PAIRWISE ALIGNMENT [top]

bestfit compare two proteins or DNA sequences; focus on a local alignment
gap compare two proteins or DNA sequences; focus on a global alignment
tfasta compare one protein to the six proteins potentially encoded from one DNA sequence

Example of how to use the bestfit program

In this example, compare two proteins (human syntaxin 1a and human syntaxin 7) by typing “bestfit” then then names of the proteins to be compared (with a space between them).

What do you need to know about this output?

• the alignment uses the one letter amino acid code; memorize it!
• the lines indicate amino acid identities
• the dots indicate similarity; this is much less informative than identity
• the gap weight and gap length penalties are parameters you can adjust to optimize the alignment you are performing. For example, if you believe that two proteins are distantly related but they don’t align well, try reducing the gap penalty to see if a less stringent alignment criterion yields a better Bestfit (at the cost of adding more gaps).

Is the homology of my two proteins statistically significant?

When you do a Bestfit, you can ask the program to measure the statistical significance of a protein comparison.

Type:
> bestfit humsyn1a.pep humsyn7.pep -ran=50

These data of the average quality based on 50 randomizations let you answer the key question, “when I get a quality score from the comparison of my two proteins (Q = 194 in this case), is that score much better than if I had just randomly shuffled one of the proteins 50 times (Q = 35.2 +/- 4.0)?” The answer is yes, because the Z score for the authentic comparison is more than three standard deviations greater than the randomized score. For this example, Z = (194 - 35.2)/ 4.0 = 39.7. If Z < 3, your two proteins are not significantly related.

GCG documentation on the bestfit program (> genhelp bestfit)

BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local homology algorithm of Smith and Waterman.

DESCRIPTION

BestFit inserts gaps to obtain the optimal alignment of the
best region of similarity between two sequences, and then
displays the alignment in a format similar to the output from
Gap. The sequences can be of very different lengths and have
only a small segment of similarity between them. You could take
a short RNA sequence, for example, and run it against a whole
mitochondrial genome.

SEARCHING FOR SIMILARITY

BestFit is the most powerful method in the Wisconsin
Package(TM) for identifying the best region of similarity
between two sequences whose relationship is unknown.

Example of how to use the Gap program

GAP (global alignment) versus Bestfit (local alignment; BLAST-like)

Note that while Bestfit and GAP are similar--each allows you to compare two sequences--they do have distinct differences. Bestfit is optimized for local alignments, finding the best region of homology between two proteins (or DNA sequences). GAP is optimized for global alignments, scanning the entire sequences. In the above example of two syntaxins, bestfit showed a better percent amino acid identity (27% over a length of 216 residues), while GAP showed only 24% identity but over a longer span of 292 residues.

GCG documentation on the gap program (> genhelp gap)

Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps.

DESCRIPTION

Gap considers all possible alignments and gap positions and creates the alignment with the largest number of matched bases and the fewest gaps. You provide a gap creation penalty and a gap extension penalty in units of matched bases. In other words, Gap must make a profit of gap creation penalty number of matches for each gap it inserts. If you choose a gap extension penalty greater than zero, Gap must, in addition, make a profit for each gap inserted of the length of the gap times the gap extension penalty. Gap uses the alignment method of Needleman and Wunsch (J. Mol. Biol. 48; 443-453 (1970).

Example of how to use the TFastA program

When do you use TFastA?
• If you obtain a DNA sequence (e.g. from a BLAST search or from the core facility following a yeast 2-hybrid result), a TFastA search will tell you its relation to any given protein. If there are frameshifts due to sequencing errors, the TFastA search will show this. Using the GCG map program, errors can be corrected.
• TFastA is an important tool for new gene discovery (a process called “database mining,” “in silico genomics,” or sometimes “in silico hallucinations”). If you are studying a gene family and want to find a new member, go to GenBank and conduct a series of tblastn searches with your proteins of interest against all the databases of expressed sequence tags. Download all DNA database entries relevant to your protein queries into GCG. Then perform TFastA searches to assign each EST (i.e. DNA sequence) to a protein. If an EST does not match any protein in your family, you have identified a novel gene.


TOPIC 5: MULTIPLE SEQUENCE ALIGNMENT [top]

GCG documentation on the PileUp program (> genhelp pileup)

Function of PileUp
PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment.

Description of PileUp
PileUp creates a multiple sequence alignment using a simplification of the progressive alignment method of Feng and Doolittle (Journal of Molecular Evolution 25; 351-360 (1987)). The method used is similar to the method described by Higgins and Sharp (CABIOS 5; 151-153 (1989)).
The multiple alignment procedure begins with the pairwise alignment of the two most similar sequences, producing a cluster of two aligned sequences. This cluster can then be aligned to the next most related sequence or cluster of aligned sequences. Two clusters of sequences can be aligned by a simple extension of the pairwise alignment of two individual sequences. The final alignment is achieved by a series of progressive, pairwise alignments that include increasingly dissimilar sequences and clusters, until all sequences have been included in the final pairwise alignment.
Before alignment, the sequences are first clustered bysimilarity to produce a dendrogram, or tree representation of clustering relationships. It is this dendrogram that directs the order of the subsequent pairwise alignments. PileUp can plot this dendrogram so that you can see the order of the pairwise alignments that created the final alignment.
As a general rule, PileUp can align up to 500 sequences, with any single sequence in the final alignment restricted to a maximum length of 7,000 characters (including gap characters inserted into the sequence by PileUp to create the alignment). However, if you include long sequences in the alignment, the number of sequences PileUp can align decreases. See the RESTRICTIONS topic, below, for a more complete discussion of
sequence number and size limitations.

Example of a PileUp

There are two easy ways to do a PileUp.
(1) On the command line, type > PileUp *.pep
All files with the suffix .pep will be made into the pileup
(2) If your directory has lots of files in it, you can create a special file called “anyname” that includes the specific proteins (or DNA sequences) you want to have included.


TOPIC 6: PHYLOGENY [top]

What is a phylogenetic tree?

A phylogenetic tree is a graphical representation of the evolutionary history of a protein family (or other family such as a gene family). Phylogeny reconstructs the branching pattern of a group of proteins; the topology of a tree shows the history (ancestors) of the protein over time.

Why are trees useful?

A phylogenetic tree provides a visual demonstration of how a group of proteins are related. Suppose you are studying a particular kinase cloned from a mouse cDNA library. A tree can help answer the following questions.
• does your kinase have closely related homologs? For example, if you knock out its gene, are there closely related kinases? If you alter the kinase activity in cells with a drug, are there other kinases likely to be affected as well?
• how should you name it?
• has it been relatively conserved throughout evolution, or has its sequence changed rapidly? This may give clues as to its function.
• what individual amino acid residues are critical for the function of my kinase?

How do I make a tree?

To make a tree, begin by doing a PileUp of your group of proteins (or nucleic acid sequences). The output of PileUp can be a multiple sequence alignment, with a suffix pileup.msf (multiple sequence format). Alternatively, you can specify the output to be a simple tree by choosing output option B.

We will describe how to make a tree in detail in the week of September 25. The key tree-making programs in GCG are distances, which makes a matrix of pairwise distances between the sequences in a multiple sequence alignment. Diverge measures the number of synonymous and nonsynonymous substitutions per site of two or more aligned protein coding regions and can output matrices of these values. GrowTree reconstructs a tree from a distance matrix or a matrix of synonymous or nonsynonymous substitutions.

In this course we will use PAUP (Phylogeny Analysis Using Parsimony), a superb tree-building program. PAUP has also been incorporated into GCG. PAUPSearch reconstructs
phylogenetic trees from a multiple sequence alignment using parsimony, distance, or maximum likelihood criteria; PAUPDisplay can manipulate and display the trees output by PAUPSearch and can also plot the trees output by GrowTree.


TOPIC 7: SEQUENCE ANALYSIS [top]

prime design oligonucleotide primers
map make a map of a DNA sequence; show restriction enzymes
map -six make a map showing restriction enzyme six-cutters
motifs x.pep scan the prosite dictionary for motifs and domains present in protein x

 

GCG documentation on the Prime program (> genhelp prime)

FUNCTION

Prime selects oligonucleotide primers for a template DNA sequence. The primers may be useful for the polymerase chain reaction (PCR) or for DNA sequencing. You can allow Prime to choose primers from the whole template or limit the choices to a particular set of primers listed in a file.

DESCRIPTION

Prime analyzes a template DNA sequence and chooses primer pairs for the polymerase chain reaction (PCR) and primers for DNA sequencing. For PCR primer pair selection, you can choose a target range of the template sequence to be amplified. For DNA
sequencing primers, you can specify positions on the template that must be included in the sequencing.

GCG documentation on the Map program (> genhelp map)

FUNCTION

Map maps a DNA sequence and displays both strands of the mapped sequence with restriction enzyme cut points above the sequence and protein translations below. Map can also create a peptide map of an amino acid sequence.

DESCRIPTION

Map displays a sequence that is being assembled or analyzed intensively. Map asks you to select the enzymes whose restriction sites should be marked individually by typing their
names. If you do not answer this question, Map selects a representative isoschizomer from all of the commercially available enzymes. You can choose to have your sequence
translated in any or all of the six possible translation frames. You can also choose to have only the open reading frames translated.

After running Map, you may create a new sequence file with the peptide sequence from any frame of DNA translation by using the ExtractPeptide program with the Map output file.

Example of how to use the Map program

Here we input a DNA sequence (human syntaxin 7) and specify that we want the output to show only enzymes that are six-cutters (158 enzymes including EcoRI, PvuII, BamHI, etc.). In all, the program can analyze sites for over 500 restriction enzymes. Note that you can specify an output display ranging from no predicted proteins to all six potential reading frames.

 

FUNCTION

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds.

DESCRIPTION

Motifs looks for protein sequence motifs by checking your protein sequence for every sequence pattern R in the PROSITE Dictionary. Motifs can recognize the patterns with some of the symbols mismatched, but not with gaps. Currently, Motifs can only search for patterns in protein sequences. There is a very informative abstract on every motif in the PROSITE Dictionary. These abstracts are displayed next to any motif found in your sequence.

Example of how to use the Motifs program