Integer vectors

Datasets of integer vectors. The vectors are computed from 100MB (104,857,600 symbols) of four datasets from the Pizza&Chili Corpus: dblp, cere, kernel and eins. For each dataset, three types of vectors are computed: Burrows-Wheeler Transform (bwt), the longest common prefix array (lcp) and the function Ψ used in compressed suffix arrays (psi). To see more details of the vectors, see Table 2 in the following publication.

DBLP-bwt (11 MB)

Burrows-Wheeler Transform of the dataset dblp of the Pizza&Chili corpus. The symbols were converted into integer values using the ASCII table. Each integer was encoded using 1 byte.

DBLP-lcp (66 MB)

Longest common prefix array of the dataset dblp of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

DBLP-psi (142 MB)

Ψ function of the dataset dblp of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

cere-bwt (8.9 MB)

Burrows-Wheeler Transform of the dataset cere of the Pizza&Chili corpus. The symbols were converted into integer values using the ASCII table. Each integer was encoded using 1 byte.

cere-lcp (200 MB)

Longest common prefix array of the dataset cere of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

cere-psi (141 MB)

Ψ function of the dataset cere of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

kernel-bwt (4.2 MB)

Burrows-Wheeler Transform of the dataset kernel of the Pizza&Chili corpus. The symbols were converted into integer values using the ASCII table. Each integer was encoded using 1 byte.

kernel-lcp (280 MB)

Longest common prefix array of the dataset kernel of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

kernel-psi (141 MB)

Ψ function of the dataset kernel of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

eins-bwt (362 KB)

Burrows-Wheeler Transform of the dataset einstein.en of the Pizza&Chili corpus. The symbols were converted into integer values using the ASCII table. Each integer was encoded using 1 byte.

eins-lcp (260 MB)

Longest common prefix array of the dataset einstein.en of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

eins-psi (139 MB)

Ψ function of the dataset einstein.en of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

Balanced parenthesis sequences

In this section we provide the balanced parentheses representation of several trees. The balanced parentheses representation of a tree consists of a depth-first preorder traversal of the tree, writing an opening parenthesis when visiting a forward edge (parent to child), and a closing parenthesis when visiting a backward edge (child to parent). Thus, a balanced parentheses representation of a tree with n nodes has 2n parentheses.

To compute new balanced parenthesis sequences based on complete binary trees and suffix trees, and to read the datasets as bitarrays, use this code

Wikipedia (13 MB)

Balanced parentheses representation of the XML of a Wikipedia dump (January 12, 2015). This dataset has 498,753,914 parentheses.

Proteins (82 MB)

Balanced parenthesis representation of the suffix tree of the protein dataset from the Pizza&Chili corpus. This dataset has 670,721,006 parentheses.

DNA (135 MB)

Balanced parenthesis representation of the suffix tree of the DNA dataset from the Pizza&Chili corpus. This dataset has 1,154,482,174 parentheses.

Complete tree (18 MB)

Balanced parentheses representation of a complete binary tree of depth 30. This dataset has 2,147,483,644 parentheses.

OpenStreetMap (76 MB)

Balanced parentheses representation of the XML of an OpenStreetMap dump (January 10, 2015). This dataset has 4,675,776,358 parentheses.

Cere (70 MB)

Balanced parenthesis representation of the suffix tree of the repetitive dataset cere from the Pizza&Chili corpus. This dataset has 1,815,764,284 parentheses.

CoreUtils (22 MB)

Balanced parenthesis representation of the suffix tree of the repetitive dataset coreutils from the Pizza&Chili corpus. This dataset has 774,329,686 parentheses.

Einstein.de (3.1 MB)

Balanced parenthesis representation of the suffix tree of the repetitive dataset einstein.de from the Pizza&Chili corpus. This dataset has 367,324,468 parentheses.

Einstein.en (11 MB)

Balanced parenthesis representation of the suffix tree of the repetitive dataset einstein.en from the Pizza&Chili corpus. This dataset has 1,828,736,484 parentheses.

Escherichia (41 MB)

Balanced parenthesis representation of the suffix tree of the repetitive dataset escherichia from the Pizza&Chili corpus. This dataset has 434,860,542 parentheses.

Influenza (39 MB)

Balanced parenthesis representation of the suffix tree of the repetitive dataset influenza from the Pizza&Chili corpus. This dataset has 603,704,964 parentheses.

Kernel (36 MB)

Balanced parenthesis representation of the suffix tree of the repetitive dataset kernel from the Pizza&Chili corpus. This dataset has 1,016,126,816 parentheses.

Para (125 MB)

Balanced parenthesis representation of the suffix tree of the repetitive dataset para from the Pizza&Chili corpus. This dataset has 1,692,193,346 parentheses.

World Leaders (4.1 MB)

Balanced parenthesis representation of the suffix tree of the repetitive dataset world_leaders from the Pizza&Chili corpus. This dataset has 179,236,696 parentheses.

Repetitive sequences

Datasets of repetitive sequences with long runs. All the sequences were generated by computing the Burrows-Wheeler Transform (BWT) of repetitive sequences.

Wiki (538 MB)

BWT of the edit history of some Wikipedia pages (see pages), using words as symbols. The Wikipedia Extractor was used to transform the XML file of the pages into a simpler text file. Then, this script was used to convert words into a contiguous integer alphabet. The final BWT can be obtained by using any suffix array algorithm for integer alphabet. This dataset has 140,990,835 symbols, an alphabet size of 174,796 and 2,586,752 of runs. Each symbol was encoded using 4 bytes.

Wiki (319 MB)

BWT of the edit history of some Wikipedia pages (see pages), using words as symbols. The Wikipedia Extractor was used to transform the XML file of the pages into a simpler text file. Then, this script was used to convert words into a contiguous integer alphabet. The final BWT can be obtained by using any suffix array algorithm for integer alphabet. This dataset has 83,374,477 symbols, an alphabet size of 188,932 and 2,694,892 of runs. Each symbol was encoded using 4 bytes.

World leaders 1B (45 MB)

BWT of the repetitive dataset World leaders of the Pizza&Chili corpus. This dataset has 46,968,182 symbols, an alphabet size of 90 and 573,487 of runs. Each symbol was encoded using 1 byte.

World leaders 2B (90 MB)

BWT of the repetitive dataset World leaders of the Pizza&Chili corpus. For this dataset, instead of taking the previous symbol during the BWT construction, the two previous symbols were taken. This dataset has 46,968,182 symbols, an alphabet size of 2,528 and 875,406 of runs. Each symbol was encoded using 2 bytes.

Kernel 1B (247 MB)

BWT of the repetitive dataset Kernel of the Pizza&Chili corpus. This dataset has 257,961,617 symbols, an alphabet size of 161 and 2,791,368 of runs. Each symbol was encoded using 1 byte.

Kernel 2B (493 MB)

BWT of the repetitive dataset World leaders of the Pizza&Chili corpus. For this dataset, instead of taking the previous symbol during the BWT construction, the two previous symbols were taken. This dataset has 257,961,617 symbols, an alphabet size of 7,124 and 4,194,799 of runs. Each symbol was encoded using 2 bytes.

Experimental datasets

Graphs, trees and parentheses