bsc - program for data compression with CUDA acceleration

Hi Everyone,

Just wanted to let you know that I have finished new major release of my high performance block-sorting data compression library:

    GPU acceleration using NVIDIA CUDA technology.

    The source code is available under LGPL license.

    64 bit and multi-core systems support.

    Highly optimized code and C++ interface for superior performance.

    Adjustable block size and multiple algorithms that allows software fine-tuning for maximum speed or compression efficiency.

    CRC-32 calculation routine for data integrity verification.

    Inplace compression and decompression to save memory.

You can download latest version from my web site http://libbsc.com/default.aspx, source code at GitHub repository https://github.com/IlyaGrebnov/libbsc.

Thanks,

Ilya Grebnov

Hi Everyone,

bsc is currently discussed in encode.ru forum http://encode.ru/threads/586-bsc-new-block-sorting-compressor/page7 where people is reporting compression of 1GB enwik9 test file in 8 seconds. This is about 125MB/c on Intel Core i7-2600K with Nvidia Geforce 560 Ti. I am very impressed. On my Core 2 Duo E8400 with Nvidia 8800 GTS 512 compression takes 30 sec.

Thanks,
Ilya

Congratulations on this work!

I’ve been hoping that someone would take on the data-parallel compression problem for years.

I have a few questions about your implementation:

  1. I noticed that your code makes heavy use of duane merill’s fast radix sort. Do you have a sense of how much time you spend in the radix sort kernel?
  2. Am I correct that you use a variation of the BWT, and that you implement the rotations with byte permutes and moves in shared memory?
  1. I did not measure time spend in CUDA kernels(100-300ms I assume), because it is insignificant compare to transfer and post coder. Most of time is spend in context modeling and entropy compression.

  2. Yes, this variant of blocks sorting algorithm is known as Schindler Transform or Sort Transform. ST sorts context by specific order. You can think about BWT as ST with unlimited order. Forward transform is very simple and can be easy implemented on CUDA. Reverse transform is harder, because you need to reconstruct all contexts compare to BWT. But I think it is a good tradeoff.