## Wiki (538 MB)

BWT of the edit history of some Wikipedia pages (see pages), using words as
symbols. The Wikipedia
Extractor was used to transform the XML file of the pages into a
simpler text file. Then, this script was used to
convert words into a contiguous integer alphabet. The final BWT can be
obtained by using any suffix array algorithm for integer alphabet.
This dataset has 140,990,835 symbols, an alphabet size of 174,796 and
2,586,752 of runs. Each symbol is encoded using 4 bytes.

## Wiki (319 MB)

BWT of the edit history of some Wikipedia pages
(see pages),
using words as
symbols. The Wikipedia
Extractor was used to transform the XML file of the pages into a
simpler text file. Then, this script was used to
convert words into a contiguous integer alphabet. The final BWT can be
obtained by using any suffix array algorithm for integer alphabet.
This dataset has 83,374,477 symbols, an alphabet size of 188,932 and
2,694,892 of runs. Each symbol is encoded using 4 bytes.

## World leaders 1B (45 MB)

BWT of the repetitive
dataset World
leaders of
the Pizza&Chili
corpus.
This dataset has 46,968,182 symbols, an alphabet size of 90 and
573,487 of runs. Each symbol is encoded using 1 byte.

## World leaders 2B (90 MB)

BWT of the repetitive
dataset World
leaders of
the Pizza&Chili
corpus. For this dataset, instead of taking the previous symbol during the
BWT construction, the two previous symbols were taken.
This dataset has 46,968,182 symbols, an alphabet size of 2,528 and
875,406 of runs. Each symbol is encoded using 2 bytes.

## Kernel 1B (247 MB)

BWT of the repetitive
dataset Kernel of
the Pizza&Chili
corpus.
This dataset has 257,961,617 symbols, an alphabet size of 161 and
2,791,368 of runs. Each symbol is encoded using 1 byte.

## Kernel 2B (493 MB)

BWT of the repetitive
dataset World
leaders of
the Pizza&Chili
corpus. For this dataset, instead of taking the previous symbol during the
BWT construction, the two previous symbols were taken.
This dataset has 257,961,617 symbols, an alphabet size of 7,124 and
4,194,799 of runs. Each symbol is encoded using 2 bytes.