Grid Block launch configuration


I would like to ask if there are anyone who knows if the configuration of the grid and block would affect the performance of a CUDA program?

My program works based on each thread to each pixels. To simplify things, I make my block size to be 1D. (i.e. dim3 block(1,N)). However, when trying to optimize, I tried to make it 2D (i.e. dim3 block(M,M) where M is multiple of 16) and it turns out it runs faster.

It really puzzles me, how the configuration improves the performance. Does it have to do with the architecture? Anyone? :turned: Thanks!

Maybe it is just your code. Could you share the kernel as well?

Without the code (at least how do you access data and what do you do with it) we can only give you some hints.

    [*]Block size can affect the occupancy (number of simultaneous blocks per multiprocessor).[*]You read data from a texture, which data is cached and prefetched in a 2D neighbors fashion)

Those are the main aspects that come to my head.