After their introduction in the mid-2000s, multicore computers — computers with more than one processing unit (called cores) and shared memory — have become pervasive. In fact, it is hard nowadays to find a single-processor desktop, let alone a high-end server. The argument for multicore systems is simple: thermodynamic and material considerations prevent chip manufacturers from ever increasing clock frequencies. Since 2005, clock frequencies have stagnated at around 3.75GHz for commodity computers, and even in 2015, 4GHz computers are still high-end (Intel Core i7-4790K is a prime example). Thus, one possible next step in performance is to take advantage of multicore computers. To do this, algorithms and data structures will have to be modified to make them behave well in parallel architectures.
Together with higher-speed processing, in the past few years, much has been written about succinct data structures. This is, in part, due to the fact that data to be processed has become large enough such that the ability to maintain it close to the processor is vital. One such structure that has benefited from thorough research is the wavelet tree. Although the wtree was original devised as a data structure for encoding a reordering of the elements of a sequence, it has been successfully used in many applications. For example, it has been used to index documents, grids and even sets of rectangles, to name a few applications.
A problem that frequently arises is that these succinct data structures are quite expensive (in time) to build, particularly as the size of the alphabet and the size of the input increase, as is the case now with the so-called “big data” revolution. Thus, sizeable contributions to the state-of-the-art would involve algorithms with good theoretical running times that are also practical in mod- ern commodity architectures with more than one core. Unfortunately, (prac- tical) parallel computing suffers from several drawbacks that make these high- performant algorithms hard to come by: on the one hand, maintaing thread independence, and on the other keeping clear of “thrashing” the memory hi- erarchy. If such algorithms are found, however, a positive side-effect is that would also help speed up processing of the target data-structures in distributed systems: if one node is faster, the whole cluster would be faster as well.
In this work we propose two parallel algorithms for the most expensive operation on wtrees, which is its construction. The first algorithm, pwt, has time complexity and uses bits of space, including the space of the final wtree and excluding the input. The second algorithm, dd, has time complexity and uses bits of space, using p threads. The pwt algorithm improves the memory consumption of [1]. Meanwhile, the dd algorithm improves the time complexity of our previous work [2] and the time complexity of [1] by a factor of . Additionally, we report experiments which show the algorithms to be practical for large datasets, achieving good speedup.
An author's version of our work can be downloaded here. The link to the official version is here.
The code for this work is available at github.
Dataset | n | σ |
---|---|---|
rna.512MB | 536,870,912 | 4 |
rna.1GB | 1,073,741,824 | 4 |
rna.2GB | 2,147,483,648 | 4 |
rna.3GB | 3,221,225,472 | 4 |
rna.4GB | 4,294,967,296 | 4 |
rna.5GB | 5,368,709,120 | 4 |
rna.6GB | 6,442,450,944 | 4 |
rna.13GB | 14,570,010,837 | 4 |
prot | 1,184,051,855 | 27 |
Dataset | n | σ |
---|---|---|
src.200MB | 210,866,607 | 230 |
src.98MB | 25,910,717 | 2,446,383 |
src.512MB | 134,217,728 | 2,446,383 |
src.1GB | 268,435,455 | 2,446,383 |
src.2GB | 536,870,911 | 2,446,383 |
en.x.27 | 134,217,728 | |
en.x.28 | 268,435,456 | |
en.x.29 | 536,870,912 | |
en.x.30 | 1,073,741,824 |
The experiments were carried out on a 4-chip (8 NUMA nodes) AMD Opteron 6278 machine with 8 physical cores per NUMA node, clocking at 2.4GHz each, with one 64KB L1 instruction cache shared by two cores, one 16KB L1 data cache per core, a 2MB L2 cache shared between two cores, and a 6MB of L3 shared between 8 cores per NUMA node. The machine had 192GB of DDR3 RAM, clocking at 1333MHz, with 24GB per NUMA node. Running times were measured using the high-resolution (nanosecond) C functions in time.h. Memory usage was measured using the tools provided by malloc_count. We repeated each trial five times and recorded the median time.
Datasets | libcds | sdsl | pwt | dd | shun | |||
---|---|---|---|---|---|---|---|---|
1 | 64 | 1 | 64 | 1 | 64 | |||
rna.512MB | 23.42 | 32.41 | 11.83 | 7.00 | 12.65 | 0.40 | 12.63 | 0.67 |
rna.1GB | 47.38 | 65.30 | 23.89 | 16.19 | 25.30 | 0.62 | 25.36 | 1.32 |
rna.2GB | 100.13 | 131.86 | 46.98 | 27.62 | 50.80 | 1.20 | 50.89 | 2.64 |
rna.3GB | 142.90 | 220.11 | 71.09 | 41.00 | 75.37 | 2.17 | 66.35 | 3.79 |
rna.4GB | - | 198.10 | 94.39 | 55.04 | 101.44 | 2.84 | - | - |
rna.5GB | - | 329.27 | 117.13 | 68.24 | 126.66 | 3.57 | - | - |
rna.6GB | - | 389.25 | 141.59 | 81.80 | 152.57 | 4.35 | - | - |
rna.13GB | - | 881.41 | 314.86 | 330.44 | 333.14 | 10.75 | - | - |
prot | 104.40 | 142.67 | 58.54 | 21.81 | 68.19 | 2.17 | 64.06 | 3.54 |
src.200MB | 24.81 | 31.41 | 14.68 | 2.67 | 17.70 | 0.52 | 16.73 | 1.06 |
src.98MB | 7.92 | 9.52 | 5.28 | 0.77 | 5.73 | 3.94 | 5.07 | 0.75 |
src.512MB | 37.77 | 49.21 | 28.94 | 5.07 | 28.98 | 5.36 | 25.52 | 3.07 |
src.1GB | 75.48 | 99.95 | 57.99 | 8.87 | 55.36 | 9.60 | 49.52 | 6.17 |
src.2GB | 150.67 | 205.41 | 112.78 | 25.30 | 110.83 | 15.11 | 98.11 | 11.77 |
en.4.27 | 8.78 | 14.24.82 | 5.75 | 1.82 | 6.50 | 0.28 | 6.98 | 0.38 |
en.4.28 | 15.82 | 28.53 | 11.44 | 3.67 | 12.88 | 0.40 | 12.34 | 0.77 |
en.4.29 | 35.43 | 57.11 | 23.01 | 7.22 | 25.51 | 0.84 | 24.68 | 1.57 |
en.4.30 | 70.00 | 113.88 | 46.10 | 14.40 | 51.06 | 1.63 | 55.56 | 3.06 |
en.6.27 | 12.44 | 19.10 | 7.98 | 1.78 | 9.58 | 0.36 | 10.46 | 0.61 |
en.6.28 | 22.65 | 38.37 | 15.92 | 3.33 | 19.35 | 0.52 | 18.38 | 1.17 |
en.6.29 | 50.28 | 76.91 | 31.78 | 7.08 | 37.90 | 1.18 | 41.86 | 2.36 |
en.6.30 | 99.66 | 153.72 | 63.62 | 15.90 | 76.59 | 2.20 | 83.29 | 4.68 |
en.8.27 | 15.87 | 26.00 | 11.48 | 1.87 | 13.15 | 0.46 | 14.10 | 0.88 |
en.8.28 | 29.06 | 52.15 | 22.86 | 3.71 | 26.52 | 0.78 | 28.28 | 1.58 |
en.8.29 | 64.84 | 105.01 | 45.79 | 7.57 | 52.53 | 1.56 | 56.68 | 3.14 |
en.8.30 | 128.65 | 209.54 | 91.83 | 14.65 | 105.00 | 3.13 | 113.13 | 6.26 |
en.10.27 | 21.32 | 33.25 | 14.61 | 2.26 | 13.94 | 1.66 | 17.26 | 1.39 |
en.10.28 | 43.55 | 68.00 | 30.32 | 6.43 | 29.05 | 2.18 | 33.15 | 2.78 |
en.10.29 | 89.96 | 136.67 | 60.69 | 9.25 | 58.55 | 4.59 | 67.16 | 5.67 |
en.10.30 | 183.57 | 281.53 | 123.88 | 17.70 | 119.14 | 8.93 | 214.2 | 10.77 |
en.12.27 | 24.38 | 39.09 | 17.97 | 2.52 | 17.33 | 2.61 | 20.33 | 1.64 |
en.12.28 | 50.17 | 80.22 | 37.66 | 7.62 | 36.36 | 2.66 | 38.97 | 3.25 |
en.12.29 | 103.39 | 161.96 | 75.09 | 10.41 | 72.46 | 5.73 | 128.35 | 6.71 |
en.12.30 | 211.66 | 333.32 | 150.02 | 20.33 | 145.04 | 9.66 | 259.21 | 12.99 |
en.14.27 | 27.44 | 43.61 | 21.92 | 3.10 | 21.39 | 2.43 | 22.51 | 1.84 |
en.14.28 | 56.44 | 90.05 | 45.85 | 6.11 | 44.70 | 2.94 | 44.53 | 3.67 |
en.14.29 | 116.15 | 182.46 | 90.41 | 12.50 | 88.37 | 6.97 | 91.53 | 7.79 |
en.14.30 | 238.36 | 377.77 | 184.83 | 22.31 | 178.58 | 10.50 | 302.14 | 15.98 |
This work was supported the doctoral scholarship of CONICYT (21120974) and in part by the Emerging Leaders in the Americas scholarship programme