using vectors in GPU kernel

Dear Experts,

I just dived in to the CUDA environment on Linux, read the tutorial and understood the matrix multiplication example. In that example, the resultant matrix C is multiples of 16 due to the block size and simplicity. In my project I need to process an image and obtain an image which does not have a size of multiples of 16. For example, the width of the resultant image is 71. Therefore I can only fit four 16 bliocks and there will be 7 pixels left to process. So my quesiton is, how should I define the block size? Can I make it different than 16? I guess 16 is like a magic number to utilize the GPU in a maximum way. If I have to use 16x16 block size, how should I process the remaining pixels?
Thanks in advance for the reply.

16 is not really a magic number. Performance depends on sensible layout of the blocks to get least divergent warps and non-colliding bank access. See this forum about these topics and check the occupancy calculator.

For “odd” sizes, leaving excess threads idle probably has the least impact on performance.


Just allocate more memory (multiply of 16) and fill unused items with zeroes.

Zero values will not affect the result matrix.

I actually thought that, I was wondering if there were any other solutions to that. I guess I will do it like that.

Thanks for the reply.