I just dived in to the CUDA environment on Linux, read the tutorial and understood the matrix multiplication example. In that example, the resultant matrix C is multiples of 16 due to the block size and simplicity. In my project I need to process an image and obtain an image which does not have a size of multiples of 16. For example, the width of the resultant image is 71. Therefore I can only fit four 16 bliocks and there will be 7 pixels left to process. So my quesiton is, how should I define the block size? Can I make it different than 16? I guess 16 is like a magic number to utilize the GPU in a maximum way. If I have to use 16x16 block size, how should I process the remaining pixels?
Thanks in advance for the reply.