The rationale behind searching a database of blocks is that information from
multiply aligned sequences is present in a concentrated form, reducing
background and increasing sensitivity to distant relationships. This
information is represented in a position-specific scoring table or "profile"
(4),
in which each column of the alignment is converted to a column of a table
representing the frequency of occurrence of each of the 20 amino acids. For
searching a database of blocks, the first position of the sequence is aligned
with the first position of the first block, and a score for that amino acid is
obtained from the profile column corresponding to that position. Scores are
summed over the width of the alignment, and then the block is aligned with the
next position. This procedure is carried out exhaustively for all positions
of the sequence for all blocks in the database, and the best alignments
between a sequence and entries in the BLOCKS database are noted. If a
particular block scores highly, it is possible that the sequence is related to
the group of sequences the block represents. Typically, a group of proteins
has more than one region in common and their relationship is represented as a
series of blocks separated by unaligned regions. If a second block for a
group also scores highly in the search, the evidence that the sequence is
related to the group is strengthened, and is further strengthened if a third
block also scores it highly, and so on.
Return to top
Optional search parameters for DNA queries:
Submitting multiple searches
If you have a large query sequence or several queries,
Here is a sample perl script
to do bulk searches.
The Blocks Email Searcher currently runs on a Sun E250 workstation.
It uses an unsophisticated first-in/first-executed queuing scheme and
can complete an average search of one typical search of 350 amino
acids every two minutes. A DNA search takes longer because the
sequence is translated in all six frames. So a 1000 residue DNA query
will take about the same amount of time as six average amino acid
queries, or about 12 minutes, and a contig of 10,000 residues will take
about two hours. Consequently, if you have more than five searches
to do, we ask that you space them at reasonable intervals depending
on the type and size of your sequences so other people can get searches
processed between yours. The Blocks Searcher is least busy on weekends
and between about 20:00 and 04:00 Pacific coast (USA) time. We
appreciate your considerate use of this service.
Return to top
Interpreting results of a search
Example using a protein query
Example using a DNA query
Heading
Query=Description line from query sequence
Size=Number of amino acids for protein query or base pairs for DNA query.
Be sure this number is correct before interpreting your results.
Blocks searched=Number of blocks searched with query.
Alignments done=Number of alignments done between query and blocks searched.
this number is used to determine the expected value for each hit.
Cutoff expected value=Maximum combined E-value reported.
This is the number of matches expected to be found merely by chance.
Summary
One line is printed per hit, where a hit consists of blocks belonging
to a protein family represented in the database of blocks searched with
combined E-value less than or equal to the cutoff.
Details
Detailed information is printed for each hit, including alignments with
the most similar sequence in each block.
Note: For searches using DNA queries, "Location" refers to the position in the query in base pairs from 5' to 3' on the forward strand, whereas the map and alignments show the translated position in amino acid residues.
Getting documentation for blocks
Following up a potentially interesting hit is often aided by examining the
full set of blocks for a group.
Hits are linked to Get blocks.
If you find the Blocks Searcher useful, please cite: Henikoff S, Henikoff JG: Protein family classification based on searching a database of blocks", Genomics 1994, 19:97-107. [Postscript PDF] Other references for this work are: 1. Henikoff S, Henikoff JG: Automated assembly of protein blocks for database searching. Nucleic Acids Res. 1991, 19:6565-6572. [Postscript PDF] 2. Bairoch A: PROSITE: A dictionary of sites and patterns in proteins. Nucleic Acids Res. 1992, 20:2013-2018. [Prosite page] 3. Bairoch A, Boeckmann B: The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 1992, 20:2019-2022. [Swiss-Prot page] 4. Henikoff JG and HENIKOFF S: Using substitution probabilities to improve position-specifiic scoring matrices", CABIOS 1996, 12:135-143. [Postscript PDF] 5. Wallace JC, Henikoff S: PATMAT: a searching and extraction program for sequence, pattern, and block queries and databases. CABIOS 1992, 8:249-254. 6. Henikoff S: Detection of Caenorhabditis transposon homologs in diverse organisms. New Biol. 1992, 4:382-388. 7. Oliver SG et al.: The complete DNA sequence of yeast chromosome III. Nature 1992, 357:38-46. 8. Bork P, Ouzounis C, Sander C, Scharf M, Schneider R, Sonnhammer E: What's in a genome? Nature 1992, 358:287. 9. Henikoff S, Henikoff JG: A protein family classifcation method for analysis of large DNA sequences, Proc. 27th HICSS 1994, p. 265-274. [Postscript PDF] 10. Henikoff S, Henikoff JG: Position-based sequence weights, J. Mol. Biol. 1994, 243:574-578. [Postscript PDF] 11. Tatusov RL, Altschul SF, Koonin EV: Detection of conserved segments in proteins: Iterative scanning of sequence databases with alignment blocks, PNAS 1994, 91:12091-12095. 12. Henikoff JG, Henikoff S: Using substitution probabilities to improve position-specific scoring matrices, CABIOS 1996, 12:135-143. 13. Henikoff S, Henikoff JG: Embedding strategies for effective use of multiple sequence alignment information, 1996, sumbitted for publication. 14. Thompson, JD, Higgins, DG and Gibson, TJ: CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, NAR 1994, 22:4673-4680. [FTP CLUSTAL W] 15. Saitou, N and Nei, M: The neighbor-joining method: A new method for reconstructing phylogenetic trees, Mol. Biol. Evol. 1987, 4:406-425. 16. Felsenstein, J: , Cladistics 1989, 5:164-166. [Phylip page] 17. McLachlan, A.: ,J. Mol. Biol. 1983, 169:15-30. 18. Bailey, T.L. and Gribskov, M.: Combining evidence using p-values: application to sequence homology searchers, Bioinformatics 1998, 14:48-54. [Postscript]