Data Set | String Type | Mean Length | Database Count | QueryCount | alphabet size | k-mer length | total database k-mers |
---|
16Sa | DNA | 1350 | 188,073 | 2000 | 4 | 7 | 16,384 |
Pyrob | DNA | 150 | 501,532 | 500 | 4 | 6 | 4,096 |
ITSc | DNA | 627 | 212,367 | 2000 | 4 | 6 | 4,096 |
Shuffled | DNA | 687 | 1,000,000 | 1000 | 4 | 7 | 16,384 |
gpIe | RNA | 398 | 20,085 | 5000 | 4 | 7 | 16,360 |
GP120f | Protein | 175 | 68,119 | 2000 | 20 | 4 | 98,695 |
Institutesg | Text | 121 | 23,768 | 1000 | 47/61 | 4 | 67,287 |
- a Greengenes 16S rRNA gene collection (DeSantis, 2006)
- b Roche-454 pyrosequences from gastrointestinal contents (Ochman, 2010)
- c Internal Transcribed Spacer region from eukaryotic ribosomal genes.
- d Derived from random repetitive shuffling of Ralstonia solanacearum strain UW486 endoglucanase precursor, DQ657652 (Castillo and Greenberg, 2007)
- e Group I catalytic introns RFAM RF00028 (Griffiths-Jones, et al., 2003)
- f HIV Envelope glycoprotein PFAM PF00516 (Finn, 2008)
- g Institute names as displayed in GenBank records. For BLAST and SSAHA2, all non-alphanumeric characters were interpreted as a space for a total of alphabet size of 47, for Simrank no substitution for any of the 61 unique characters was performed.