We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Information Theory

Title: Compressing Multisets with Large Alphabets

Abstract: Current methods that optimally compress multisets are not suitable for high-dimensional symbols, as their compute time scales linearly with alphabet size. Compressing a multiset as an ordered sequence with off-the-shelf codecs is computationally more efficient, but has a sub-optimal compression rate, as bits are wasted encoding the order between symbols. We present a method that can recover those bits, assuming symbols are i.i.d., at the cost of an additional $\mathcal{O}(|\mathcal{M}|\log M)$ in average time complexity, where $|\mathcal{M}|$ and $M$ are the total and unique number of symbols in the multiset. Our method is compatible with any prefix-free code. Experiments show that, when paired with efficient coders, our method can efficiently compress high-dimensional sources such as multisets of images and collections of JSON files.
Subjects: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
Cite as: arXiv:2107.09202 [cs.IT]
  (or arXiv:2107.09202v1 [cs.IT] for this version)

Submission history

From: Daniel Severo [view email]
[v1] Thu, 15 Jul 2021 16:54:38 GMT (559kb,D)

Link back to: arXiv, form interface, contact.