I just dived in to the CUDA environment on Linux, read the tutorial and understood the matrix multiplication example. In that example, the resultant matrix C is multiples of 16 due to the block size and simplicity. In my project I need to process an image and obtain an image which does not have a size of multiples of 16. For example, the width of the resultant image is 71. Therefore I can only fit four 16 bliocks and there will be 7 pixels left to process. So my quesiton is, how should I define the block size? Can I make it different than 16? I guess 16 is like a magic number to utilize the GPU in a maximum way. If I have to use 16x16 block size, how should I process the remaining pixels?
Thanks in advance for the reply.
16 is not really a magic number. Performance depends on sensible layout of the blocks to get least divergent warps and non-colliding bank access. See this forum about these topics and check the occupancy calculator.
For “odd” sizes, leaving excess threads idle probably has the least impact on performance.