Skip to main content

Table 1 Simrank database binary file structure and storage requirements.

From: Simrank: Rapid and sensitive general-purpose k-mer search tool

File Segment

File Element

Storage Requirement (bytes)

1

F, string ID field size

10

2

K, k-mer length

10

3

N, string count

10

4

string ID array

FN

5

offset arraysa

6

k-mer arrayb

Kk

7

offsets index arrayc

4k

8

offsets lengths arrayd

4k

9

unique k-mers per string arraye

4N

10

k, unique k-mer count

10

11

file position of segment 6

10

  1. aEach k-mer generates a vector of string indices, encoded as an integer array of offsets required to "visit" each string index containing the k-mer. k is the count of unique k-mers, and si is the count of strings containing the ith k-mer. Each offset is stored as a 4-byte integer.
  2. bLexically sorted ASCII text strings of each unique k-mer stored as one byte per character.
  3. c4-byte integer list of file positions for the start of each k-mer's list of offsets.
  4. d4-byte integer list of the byte length of each k-mer's list of offsets.
  5. e4-byte integer list of the count of unique k-mers in each string.