Use of char instead of float: Performance enhances

Hi everyone,

I’m performing Image Convolution operation using CUDA. Since, the image pixel values are from [0,255], so I thought of using char data type instead of float for storing and performing operations.

By doing this I’m getting a speedup of almost 1.5x. The kernel contains multiplication and addition operations only. I’m also using a static shared memory of BLOCKSIZEBLOCKSIZE bytes (x4 for floats) (BLOCKSIZEBLOCKSIZE threads per block).

Two questions about this speedup:

  1. Is this because no. of active blocks per mutiprocessor increases since Shared Memory requirement of each block is low?

  2. Or, is it due to some vector operations, ie. more chars can be added at a time than floats (since more chars will fit into a vector). May be I’d sound foolish, coz I don’t know low-level details of instruction execution.

Or, is there some other concept coming into play. (I can upload the code also, if you want.)

I’ve also heard that while performing operations on a char, compiler inplicitly converts it into an integer.

Please help!

As far as I can glimpse from the description, the kernel is not computationally limited, but rather memory bandwidth limited. It is not surprising that you would see a speedup when cutting bandwidth requirements by a factor of four (8-bit char instead of 32-bit float). You may want to try the use of a vector type like uchar4 to further optimize the memory throughput, if do not already do that (it is not clear from the description). Note that the use of vector types increases the alignment requirements.