I’m performing Image Convolution operation using CUDA. Since, the image pixel values are from [0,255], so I thought of using char data type instead of float for storing and performing operations.
By doing this I’m getting a speedup of almost 1.5x. The kernel contains multiplication and addition operations only. I’m also using a static shared memory of BLOCKSIZEBLOCKSIZE bytes (x4 for floats) (BLOCKSIZEBLOCKSIZE threads per block).
Two questions about this speedup:
Is this because no. of active blocks per mutiprocessor increases since Shared Memory requirement of each block is low?
Or, is it due to some vector operations, ie. more chars can be added at a time than floats (since more chars will fit into a vector). May be I’d sound foolish, coz I don’t know low-level details of instruction execution.
Or, is there some other concept coming into play. (I can upload the code also, if you want.)
I’ve also heard that while performing operations on a char, compiler inplicitly converts it into an integer.