Web document 5.16. The Find-a-Gene Project applied to finding a novel globin gene/protein.
[1] Choose the name of a favorite protein you are interested in. Include the species and the accession number. As an example, we will select human beta globin (NP_000509).
>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens]MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN
ALAHKYH
[2] Perform a tblastn search against a DNA database consisting of genomic DNA or ESTs. The BLAST server can be at NCBI or elsewhere. Include the output of that BLAST search in your document. We can search an EST database at NCBI, restricting the output to plants.
The result is shown here as a screen capture.

Here is the same output in text format:
Score E
Sequences producing significant alignments: (Bits) Value
gb|DV752714.1|
joct57 Jojoba Subtracted Library Simmondsia ch... 295
5e-79
gb|DV753025.1|
jo10E07 Jojoba Subtracted Library Simmondsia c... 199
7e-50
dbj|CI144093.1|
CI144093 Oryza sativa (japonica cultivar-grou... 103 3e-21
dbj|CI226068.1| CI226068 Oryza sativa (japonica cultivar-grou... 88.6
1e-16
gb|DR377150.1|
992651 CERES-131 Arabidopsis thaliana cDNA clo... 84.3 2e-15
gb|DR297241.1|
33830 CERES-131 Arabidopsis thaliana cDNA clon... 72.8 7e-12
gb|DR378263.1| 11191627 CERES-147 Arabidopsis thaliana cDNA c... 71.2
2e-11
gb|DR382882.1| 11447626 CERES-148 Arabidopsis thaliana cDNA
c... 61.6 2e-08
gb|DN482671.1|
root1_a5.y1.abd tef (Kaye Murri) root Eragrost... 60.1 5e-08
gb|DY617223.1| AC2977 NOLLY Medicago truncatula cDNA 5',
mRNA se 38.1 0.20
gb|DT731130.1|
EST1164980 Aquilegia cDNA library Aquilegia fo... 38.1
0.20
gb|DY616484.1| AC1987 NOLLY Medicago truncatula cDNA 5',
mRNA se 37.4 0.33
gb|AW980553.1| EST391706 GVN Medicago truncatula cDNA clone
p... 37.0 0.44
gb|BG583691.1| EST485444 GVN Medicago truncatula cDNA clone
p... 36.6 0.57
gb|BG583110.1| EST484860 GVN Medicago truncatula cDNA clone
p... 36.6 0.57
gb|BG582920.1| EST484666 GVN Medicago truncatula cDNA clone
p... 36.6 0.57
gb|BG582766.1| EST484512 GVN Medicago truncatula cDNA clone
p... 36.6 0.57
gb|BG582462.1| EST484207 GVN Medicago truncatula cDNA clone
p... 36.6 0.57
gb|BG581019.1| EST482748 GVN Medicago truncatula cDNA clone
p... 36.6 0.57
gb|BG448590.1|
NF045D01NR1F1010 Nodulated root Medicago trunc... 36.6 0.57
gb|BE124659.1| EST393694 GVN Medicago truncatula cDNA clone
p... 36.6 0.57
gb|BE124510.1| EST393545 GVN Medicago truncatula cDNA clone
p... 36.6 0.57
gb|AW981135.1|
EST392288 GVN Medicago truncatula cDNA clone p... 36.6 0.57
gb|AW980702.1|
EST391855 GVN Medicago truncatula cDNA clone p... 36.6 0.57
gb|AW686749.1| NF042B09NR1F1000 Nodulated root Medicago
trunc... 36.6 0.57
gb|DY617481.1| AC3491 NOLLY Medicago truncatula cDNA 5',
mRNA se 35.8 0.97
gb|BI139647.1| IP1_47_C11.g1_A002 Immature pannicle 1 (IP1)
S... 35.8 0.97
gb|BI139553.1| IP1_45_G04.g1_A002 Immature pannicle 1 (IP1)
S... 35.8 0.97
gb|BI098056.1| IP1_31_E03.g1_A002 Immature pannicle 1 (IP1)
S... 35.8 0.97
gb|BI076380.1| IP1_26_F03.g1_A002 Immature pannicle 1 (IP1)
S... 35.8 0.97
gb|BI075530.1| IP1_21_F06.g1_A002 Immature pannicle 1 (IP1)
S... 35.8 0.97
gb|BI074288.1| IP1_13_E04.g1_A002 Immature pannicle 1 (IP1)
S... 35.8 0.97
gb|BG948156.1| IP1_10_B01.g1_A002 Immature pannicle 1 (IP1)
S... 35.8 0.97
gb|DR581742.1| WS0303.B21_K18 WS-SE-A-16 Picea glauca cDNA
cl... 35.4 1.3
gb|DW003524.1| KR3B.106B03F.051109T7 KR3B Nicotiana tabacum
c... 35.0 1.7
gb|CO479818.1|
GQ018M13.T24_G02 GQ018: Clean ROOTS systems - ... 35.0
1.7
gb|CO221073.1|
WS01011.B21_O21 SS-R-N-A-11 Picea sitchensis c... 35.0
1.7
gb|CO216075.1|
WS0043.B21_N16 SS-R-A-5 Picea sitchensis cDNA ... 35.0
1.7
gb|EG009764.1|
SSBT006F20x SSBT Solanum tuberosum cDNA clone ... 34.7
2.2
Inspecting the pairwise alignments, some are from well-characterized plants (such as the thale cress Arabidopsis thaliana and rice, Oryza sativa) that have had their entire genomes sequenced and are less likely to have “novel” genes that have not yet been annotated. We will select a sequence (highlighted in yellow) from Aquilegia even though its E value (0.20) is marginal. The pairwise alignment is as follows (as a screen capture, then as pasted text):

>gb|DT731130.1| EST1164980 Aquilegia cDNA library Aquilegia formosa x Aquilegia pubescens cDNA clone CO1T402, mRNA sequence.
Length=878 Score = 38.1 bits (87), Expect = 0.20 Identities = 33/146 (22%), Positives = 63/146 (43%), Gaps = 11/146 (7%) Frame = +2 Query 5 TPEEKSAVTALWG--KVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVK 62 T ++++ V W K N+ E+ + +L + P + F D T + NPK+KSbjct 95 TEQQEALVKESWEIMKQNIPELSLQFFTTILEIAPAAKGLFSFLKD--TDEVPQNNPKLK 268 Query 63 AHGKKVLGAFSDGLAHL------DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLA 116AH KV + L D + T L +H K +DP +F ++ L+ +
Sbjct 269 AHAVKVFKMTCEAAVQLREKGAVDLPESTLKYLGAVHVKKGVIDP-HFEVVKEALLRTIK 445 Query 117 HHFGKEFTPPVQAAYQKVVAGVANAL 142G++++ + A+ + +A A+
Sbjct 446 DGVGEKWSEELCGAWSEAYDQLATAI 523
[3] Gather information about this “novel” protein. At a minimum, identify the protein sequence of the “novel” protein as displayed in the BLAST results from step [2]. In some cases, you will be able to do further BLAST searches to obtain even more sequence of your novel gene.
We can tentatively name this protein Aquilegia globin. We can gather its protein sequence from the above pairwise alignment, taking the amino acids from the subject lines:
TEQQEALVKESWEIMKQNIPELSLQFFTTILEIAPAAKGLFSFLKD--TDEVPQNNPKLK
AHAVKVFKMTCEAAVQLREKGAVDLPESTLKYLGAVHVKKGVIDP-HFEVVKEALLRTIK
DGVGEKWSEELCGAWSEAYDQLATAI
We also know that the DNA sequence (878 base pairs in accession DT731130) encodes this protein.
[4] Is this Aquilegia protein “novel”? We will try two approaches.
First, use the DNA accession DT731130 as a query in a blastx search of the nonredundant database, using default parameters. The best matches are as follows:

In terms of pairwise alignments, the best two matches are to proteins from Arabidopsis and from Brassica:

Since these best matches are not from Aquilegia, we can conclude that no one has previously identified our Aquilegia DNA as encoding a globin. It is “novel” and we can study it further.
As a second approach, take the Aquilegia protein sequence and use it in a query of the non-redundant database:

Again, there is no match to a known Aquilegia protein. We have identified a novel globin. Note that all the results of this search are plants (including mosses) that have been named hemoglobin, globin, or leghemoglobin.
For the find-a-gene project, the remaining steps are as follows:
[5] Generate a multiple sequence alignment with your novel protein, your original query protein, and a group of other members of this family. A typical number of proteins to use in a multiple sequence alignment is a minimum of 5 or 10 and a maximum 30, although the exact number is up to you. We cover multiple sequence alignment in Chapter 6.
[6] Create a phylogenetic tree, using either a parsimony or distance-based approach. Bootstrapping and tree rooting are optional. Use any program such as MEGA3, PAUP, or Phylip. We describe phylogeny in Chapter 7.
[7] Optional: compare the predicted structure of your protein to that of a known structure. We discuss protein structure prediction in Chapter 11.
[8] Optional: show whether this gene is under positive or negative evolutionary selection. We present tools for this type of analysis in Chapter 7.
[9] Optional: discuss the significance of your novel gene. What have you learned about this gene/protein family?