Skip to main content

Table 2 Datasets used for performance evaluation

From: Simrank: Rapid and sensitive general-purpose k-mer search tool

Data Set

String Type

Mean Length

Database Count

QueryCount

alphabet size

k-mer length

total database k-mers

16Sa

DNA

1350

188,073

2000

4

7

16,384

Pyrob

DNA

150

501,532

500

4

6

4,096

ITSc

DNA

627

212,367

2000

4

6

4,096

Shuffled

DNA

687

1,000,000

1000

4

7

16,384

gpIe

RNA

398

20,085

5000

4

7

16,360

GP120f

Protein

175

68,119

2000

20

4

98,695

Institutesg

Text

121

23,768

1000

47/61

4

67,287

  1. a Greengenes 16S rRNA gene collection (DeSantis, 2006)
  2. b Roche-454 pyrosequences from gastrointestinal contents (Ochman, 2010)
  3. c Internal Transcribed Spacer region from eukaryotic ribosomal genes.
  4. d Derived from random repetitive shuffling of Ralstonia solanacearum strain UW486 endoglucanase precursor, DQ657652 (Castillo and Greenberg, 2007)
  5. e Group I catalytic introns RFAM RF00028 (Griffiths-Jones, et al., 2003)
  6. f HIV Envelope glycoprotein PFAM PF00516 (Finn, 2008)
  7. g Institute names as displayed in GenBank records. For BLAST and SSAHA2, all non-alphanumeric characters were interpreted as a space for a total of alphabet size of 47, for Simrank no substitution for any of the 61 unique characters was performed.