Optimizing Data Transfer Using Lossless Compression with NVIDIA nvcomp

Originally published at: https://developer.nvidia.com/blog/optimizing-data-transfer-using-lossless-compression-with-nvcomp/

One of the most interesting applications of compression is optimizing communications in GPU applications. GPUs are getting faster every year. For some apps, transfer rates for getting data in or out of the GPU can’t keep up with the increase in GPU memory bandwidth or computational power. Often, the bottleneck is the interconnect between GPUs…

Hello CUDA developers!
Hope you enjoyed our blog about compression, and can find time to play with the library. Let us know if you have any questions or comments. Also feel free to submit issues directly to the GitHub page!

@nsakharnykh, very interesting new feature and blog entry !

from the intro

Often, the bottleneck is the interconnect between GPUs or between CPU and GPU.

does it mean we can use nvcomp to compress data on the CPU and decompress it on the GPU (or vice versa) ?

Currently, nvcomp only provides GPU implementations for compressors and decompressors, and one can implement CPU variants outside of nvcomp, since the compression format is fully open and explained in the docs. The main use case highlighted in the blog is for compressing GPU-to-GPU communications, and in this case we only need GPU-side compressors/decompressors. In near future we are planning to enable better compatibility with standard LZ4, so one can use existing CPU LZ4 libraries to compress on the CPU and nvcomp to decompress on the GPU - this is tracked here https://github.com/NVIDIA/nvcomp/issues/20. Also see the following issue for tracking general CPU implementations https://github.com/NVIDIA/nvcomp/issues/12, but it’s not on our roadmap at the moment.

Impressive work! Thank you for the blog post.

Is there an overlap between the compression (or decompression) and the computation happening on the GPUs?

@Alturkestani Thanks for the great question. In this example, we are not overlapping compression/decompression with other computations/operations.

Overlapping CPU computations with compression/decompression is relatively easy, as both are implemented asynchronously in our current API, so you could initiate compression/decompression, and then perform computations on the CPU while the GPU is busy.

Overlapping data transfer with compression/decompression, requires splitting the data into smaller chunks, so that while one chunk is being transferred, another can be compressed/decompressed.