CPU->CUDA copy with the help of compression

Did anybody have any success lowering the time taken for CPU->CUDA copy using compression on host before the actual copy (copying compressed bytes) and parallel decompression on device?

I’ve been using nvcomp (written on top of lz4) for that matter but didn’t have any success finding the right balance of compression overhead and speed up in the copy part. I was playing with the chunk size, i.e. lower chunk size more compression you do for each tensor but you can parallelize decompression part better.

Also tried different other compression algorithms. Though decided to stick to nvcomp/lz4 finding it to be faster than the rest.

nvcomp didn’t ring a bell for me; I assume this is about the compression described in this article on the NVIDA developer blog:

https://developer.nvidia.com/blog/optimizing-data-transfer-using-lossless-compression-with-nvcomp/

Are you looking for generic on-the-fly compression, or do yo have a specific use case in mind? I recall coming across various publications about application-specific on-the-fly compression schemes for GPUs over the past decade or so which should be easy to find with Google Scholar.