Seminario
Interdipartimentale di Algoritmica
Dipartimento di Informatica (ex-Scienze dell'Informazione) - D(S)I
via Salaria 113, piano terzo
Aula Seminari
Abstract:
We study the problem of compressing massive tables within the
partition-training paradigm introduced by Buchsbaum et al.\ [SODA'00],
in which a table is partitioned by an off-line training procedure into
disjoint intervals of columns, each of which is compressed separately
by a standard, on-line compressor like gzip. We provide a new theory
that unifies previous experimental observations on partitioning and
heuristic observations on column permutation, all of which are used to
improve compression rates. Based on the theory, we devise the first
on-line training algorithms for table compression, which can be
applied to individual files, not just continuously operating sources;
and also a new, off-line training algorithm, based on a link to the
asymmetric traveling salesman problem, which improves on prior work by
rearranging columns prior to partitioning. We demonstrate these
results experimentally. On various test files, the on-line algorithms
provide 35--55\% improvement over gzip with negligible slowdown; the
off-line reordering provides up to 20\% further improvement over
partitioning alone. We also show that a variation of the table
compression problem is MAX-SNP hard.
JOINT WORK with A.L. Buchsbaum and G. Flowlers -- AT&T Research.