Hi,
I would like to ask if there are anyone who knows if the configuration of the grid and block would affect the performance of a CUDA program?
My program works based on each thread to each pixels. To simplify things, I make my block size to be 1D. (i.e. dim3 block(1,N)). However, when trying to optimize, I tried to make it 2D (i.e. dim3 block(M,M) where M is multiple of 16) and it turns out it runs faster.
It really puzzles me, how the configuration improves the performance. Does it have to do with the architecture? Anyone? External Image Thanks!