Parallel Construction of Succinct Data Structures

PhD Thesis of José Fuentes-Sepúlveda
Advisor: Leo Ferres, PhD
Co-advisor: Meng He, PhD
Department of Computer Science
University of Concepción, Chile

Introduction

As of 31st December, 2013, the total number of accessible Web pages amounted to 14.3 trillions, of which only 48 billions were indexed by Google (∼ 0.0034% of the total) and 14 billions were indexed by Microsoft’s Bing (∼ 0.00098% of the total) [2]. Around the same time, the total size of accessible data was calculated to be approximately 672EB (exabytes), not including very large private databases such as Facebook, Twitter, the stock exchange, human genome, among others. It is safe to assume that, barring some unexpected catastrophe, the same trend will persist in the coming years: the size of the data available on the Internet will remain growing exponentially. Thus, it now has become imperative to find ways to solve the problem of reading, processing and storing those enormous amounts of data. To date, two main approaches have been proposed to solve the problem: the traditional increase in the machines’ processing power (led by clock speed up until the beginning of the millenium, and superseded by adding processors and cores as of 2004), and the more modern, algorithm-based minimization of the space required to store data.

After their introduction in the mid-2000s, multicore computers have become pervasive. In fact, it is hard nowadays to find a single-core desktop, let alone a high-end-server. The argument for multicore systems is simple [4, 5]: thermodynamic and material considerations prevent chip manufacturers from increasing clock frequencies beyond 4GHz. Since 2005, clock frequencies have stagnated at around 3.75GHz for commodity computers, and even in 2013 4GHz computers are rare. With more processing power, we can speed up algorithms that process large data and, accordingly, process more data in less time. The first approach delineated in the previous paragraph aims at taking advantage of these new multi-core architectures.

The second approach, the minimization of the space needed by data, can be further sub-divided into two categories of algorithms: those reducing the space needed to store the data and those reducing space considering some operations of interest. The algorithms in the first category reduce space by exploiting regularities in the data. This approach is known as compression. Huffman code [1] and Lempel-Ziv [6] belong to this category. Operations on the compressed data are not always possible, requiring users to decompress the data either partially or totally. The second category of algorithms use the information-theoretic minimum number of bits to represent data while supporting operations in ideally optimal time, that are of interest to the problem at hand. This approach is known as succinct data structures [3]. In general, compression techniques use less space than succinct data structures, but succinct data structures support operations directly without requiring decompression. Succinct data structures have constant or logarithmic time complexity in most of their primitive operations. Therefore, in context where the data will be queried constantly, succinct data structures are a better choice.

A weak point of succinct data structures is their construction time, which is generally quite slow compared to other operations of the data structure. We integrate both approaches by solving the problem of construction time, using the capabilities of multicore machines. This thesis will focus on improving the design of succinct data structures over multicore architectures, obtaining good theoretical results that are also practical. In this work, practical results means results that can be implemented in commodity architectures, results that scale on time over the number of cores on a machine and results where their implementations use memory space accordingly with the theoretical results. Improving the construction time using multicore architectures allows us to design succinct data structures with competitive querying time, efficient space usage and fast/scalable construction time.

In particular, three succinct data structures will be addressed: wavelet trees, succinct ordinal trees and triangulated plane graphs. For wavelet trees, we present two construction algorithms that achieves $O (n)$ and $O (\lg n)$ time using $O (n, \lg σ + σ, \lg n)$ and $O (n, \lg σ + p, σ \frac{\lg n}{\lg σ})$ bits of space, respectively, where $n$ is the size of the input, $σ$ is the alphabet size and $p$ is the number of available threads. For succinct trees, we introduce a practical construction algorithm that takes $O (\lg n)$ time and $O (n, \lg n)$ bits of space for a tree on $n$ nodes. For triangulated plane graphs, we present a parallel algorithm that computes the succinct representation of a triangulated plane graph, with $n$ vertices and a canonical ordering, in $O (\lg n)$ time and $O (n, \lg n)$ bits of space.

PDF document

This thesis is available here.

Code and datasets

The code and datasets used in this thesis are available in the following sections.

[1] David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the Institute of Radio Engineers, 40(9):1098-1101, September 1952.
[2] Facts Hunt. Total number of websites & size of the internet as of 2013. Last accessed: March 04, 2015. URL: Link
[3] Guy Joseph Jacobson. Succinct Static Data Structures. PhD thesis, School of computer science, Carnegie Mellon University, Pittsburgh, PA, USA, 1988.
[4] Paul Otellini. Keynote speech at intel developer forum, 2003. Last accessed: March 05, 2015. URL: Link
[5] Herb Sutter. The free lunch is over: A fundamental turn toward concurrency in software, 2005. Last accessed: March 05, 2015. URL: Link
[6] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor., 23(3):337-343, May 1977.

This work was supported the doctoral scholarship of CONICYT (21120974) and in part by the Emerging Leaders in the Americas scholarship programme

We would like to thank to Free DNS and HighCharts for their services