## DBLP-bwt (11 MB)

Burrows-Wheeler Transform of the dataset dblp of the Pizza&Chili corpus. The symbols were converted into integer values using the ASCII table. Each integer was encoded using 1 byte.

Datasets of integer vectors. The vectors are computed from 100MB (104,857,600 symbols) of four datasets from the Pizza&Chili Corpus: dblp, cere, kernel and eins. For each dataset, three types of vectors are computed: Burrows-Wheeler Transform (bwt), the longest common prefix array (lcp) and the function Ψ used in compressed suffix arrays (psi). To see more details of the vectors, see Table 2 in the following publication.

Burrows-Wheeler Transform of the dataset dblp of the Pizza&Chili corpus. The symbols were converted into integer values using the ASCII table. Each integer was encoded using 1 byte.

Longest common prefix array of the dataset dblp of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

Ψ function of the dataset dblp of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

Burrows-Wheeler Transform of the dataset cere of the Pizza&Chili corpus. The symbols were converted into integer values using the ASCII table. Each integer was encoded using 1 byte.

Longest common prefix array of the dataset cere of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

Ψ function of the dataset cere of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

Burrows-Wheeler Transform of the dataset kernel of the Pizza&Chili corpus. The symbols were converted into integer values using the ASCII table. Each integer was encoded using 1 byte.

Longest common prefix array of the dataset kernel of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

Ψ function of the dataset kernel of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

Burrows-Wheeler Transform of the dataset einstein.en of the Pizza&Chili corpus. The symbols were converted into integer values using the ASCII table. Each integer was encoded using 1 byte.

Longest common prefix array of the dataset einstein.en of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

Ψ function of the dataset einstein.en of the Pizza&Chili corpus. Each integer was encoded using 4 bytes.

In this section we provide the balanced parentheses representation of
several trees. The balanced parentheses representation of a tree
consists of a depth-first preorder traversal of the tree, writing an
opening parenthesis when visiting a forward edge (parent to child), and
a closing parenthesis when visiting a backward edge (child to
parent). Thus, a balanced parentheses representation of a tree
with *n* nodes has *2n* parentheses.

To compute new balanced parenthesis sequences based on complete binary trees and suffix trees, and to read the datasets as bitarrays, use this code

Balanced parentheses representation of the XML of a Wikipedia dump (January 12, 2015). This dataset has 498,753,914 parentheses.

Balanced parenthesis representation of the suffix tree of the protein dataset from the Pizza&Chili corpus. This dataset has 670,721,006 parentheses.

Balanced parenthesis representation of the suffix tree of the DNA dataset from the Pizza&Chili corpus. This dataset has 1,154,482,174 parentheses.

Balanced parentheses representation of a complete binary tree of depth 30. This dataset has 2,147,483,644 parentheses.

Balanced parentheses representation of the XML of an OpenStreetMap dump (January 10, 2015). This dataset has 4,675,776,358 parentheses.

Balanced parenthesis representation of the suffix tree of the repetitive dataset cere from the Pizza&Chili corpus. This dataset has 1,815,764,284 parentheses.

Balanced parenthesis representation of the suffix tree of the repetitive dataset coreutils from the Pizza&Chili corpus. This dataset has 774,329,686 parentheses.

Balanced parenthesis representation of the suffix tree of the repetitive dataset einstein.de from the Pizza&Chili corpus. This dataset has 367,324,468 parentheses.

Balanced parenthesis representation of the suffix tree of the repetitive dataset einstein.en from the Pizza&Chili corpus. This dataset has 1,828,736,484 parentheses.

Balanced parenthesis representation of the suffix tree of the repetitive dataset escherichia from the Pizza&Chili corpus. This dataset has 434,860,542 parentheses.

Balanced parenthesis representation of the suffix tree of the repetitive dataset influenza from the Pizza&Chili corpus. This dataset has 603,704,964 parentheses.

Balanced parenthesis representation of the suffix tree of the repetitive dataset kernel from the Pizza&Chili corpus. This dataset has 1,016,126,816 parentheses.

Balanced parenthesis representation of the suffix tree of the repetitive dataset para from the Pizza&Chili corpus. This dataset has 1,692,193,346 parentheses.

Balanced parenthesis representation of the suffix tree of the repetitive dataset world_leaders from the Pizza&Chili corpus. This dataset has 179,236,696 parentheses.

Datasets of repetitive sequences with long runs. All the sequences were generated by computing the
*Burrows-Wheeler Transform (BWT)* of repetitive sequences.

BWT of the edit history of some Wikipedia pages (see pages), using words as symbols. The Wikipedia Extractor was used to transform the XML file of the pages into a simpler text file. Then, this script was used to convert words into a contiguous integer alphabet. The final BWT can be obtained by using any suffix array algorithm for integer alphabet. This dataset has 140,990,835 symbols, an alphabet size of 174,796 and 2,586,752 of runs. Each symbol was encoded using 4 bytes.

BWT of the edit history of some Wikipedia pages (see pages), using words as symbols. The Wikipedia Extractor was used to transform the XML file of the pages into a simpler text file. Then, this script was used to convert words into a contiguous integer alphabet. The final BWT can be obtained by using any suffix array algorithm for integer alphabet. This dataset has 83,374,477 symbols, an alphabet size of 188,932 and 2,694,892 of runs. Each symbol was encoded using 4 bytes.

BWT of the repetitive dataset World leaders of the Pizza&Chili corpus. This dataset has 46,968,182 symbols, an alphabet size of 90 and 573,487 of runs. Each symbol was encoded using 1 byte.

BWT of the repetitive dataset World leaders of the Pizza&Chili corpus. For this dataset, instead of taking the previous symbol during the BWT construction, the two previous symbols were taken. This dataset has 46,968,182 symbols, an alphabet size of 2,528 and 875,406 of runs. Each symbol was encoded using 2 bytes.

BWT of the repetitive dataset Kernel of the Pizza&Chili corpus. This dataset has 257,961,617 symbols, an alphabet size of 161 and 2,791,368 of runs. Each symbol was encoded using 1 byte.

BWT of the repetitive dataset World leaders of the Pizza&Chili corpus. For this dataset, instead of taking the previous symbol during the BWT construction, the two previous symbols were taken. This dataset has 257,961,617 symbols, an alphabet size of 7,124 and 4,194,799 of runs. Each symbol was encoded using 2 bytes.