Did anybody have any success lowering the time taken for CPU->CUDA copy using compression on host before the actual copy (copying compressed bytes) and parallel decompression on device?
I’ve been using nvcomp (written on top of lz4) for that matter but didn’t have any success finding the right balance of compression overhead and speed up in the copy part. I was playing with the chunk size, i.e. lower chunk size more compression you do for each tensor but you can parallelize decompression part better.
Also tried different other compression algorithms. Though decided to stick to nvcomp/lz4 finding it to be faster than the rest.