I am doing a GPU project. Now, there is a tricky problem. I hope to increase the grid size, and reduce the block size. The reason is the benchmark suite I am testing has not large enough block numbers in their kernel functions.
For example, the block size is 256 threads/block. I shrink it to 128. The grid size will be doubled automatically. However, when I run the new version of CUDA code. It always gives me the error message:
CUDA error: unspecified launch failure
I also tried to make block size larger. It has no such error, but the output result is not correct.
Does somebody have the similar experience? How can I manually increase the block numbers in the kernel function without impacting the final output? If I need to modify the raw input data. Because I am told that block size is programmer decided, why I can not change it.
Thank you so much!