I would like to ask if there are anyone who knows if the configuration of the grid and block would affect the performance of a CUDA program?
My program works based on each thread to each pixels. To simplify things, I make my block size to be 1D. (i.e. dim3 block(1,N)). However, when trying to optimize, I tried to make it 2D (i.e. dim3 block(M,M) where M is multiple of 16) and it turns out it runs faster.
It really puzzles me, how the configuration improves the performance. Does it have to do with the architecture? Anyone? Thanks!