Simrank: Rapid and sensitive general-purpose k-mer search tool

BMC Ecology

Table 2 Datasets used for performance evaluation

Data Set	String Type	Mean Length	Database Count	QueryCount	alphabet size	k-mer length	total database k-mers
16S^a	DNA	1350	188,073	2000	4	7	16,384
Pyro^b	DNA	150	501,532	500	4	6	4,096
ITS^c	DNA	627	212,367	2000	4	6	4,096
Shuffle^d	DNA	687	1,000,000	1000	4	7	16,384
gpI^e	RNA	398	20,085	5000	4	7	16,360
GP120^f	Protein	175	68,119	2000	20	4	98,695
Institutes^g	Text	121	23,768	1000	47/61	4	67,287

^a Greengenes 16S rRNA gene collection (DeSantis, 2006)
^b Roche-454 pyrosequences from gastrointestinal contents (Ochman, 2010)
^c Internal Transcribed Spacer region from eukaryotic ribosomal genes.
^d Derived from random repetitive shuffling of Ralstonia solanacearum strain UW486 endoglucanase precursor, DQ657652 (Castillo and Greenberg, 2007)
^e Group I catalytic introns RFAM RF00028 (Griffiths-Jones, et al., 2003)
^f HIV Envelope glycoprotein PFAM PF00516 (Finn, 2008)
^g Institute names as displayed in GenBank records. For BLAST and SSAHA2, all non-alphanumeric characters were interpreted as a space for a total of alphabet size of 47, for Simrank no substitution for any of the 61 unique characters was performed.

ISSN: 1472-6785