Is thread size limited by functions?

We are trying to run a pretty complicated financial model on Telsa. The designed blocksize is 320 and the gridsize is 300. When we ran the program, we always got the error message “unspecified launch failure” when the program reach the point “cudaThreadSynchronize()”. The program can only execute successfully after we downsize the gridsize to 10. However, we dont get the performance we expected. Can anybody tell us if this is a machine-related feature or we can do anything in our program? Thanks a lot.

You will never get good performance with a grid of only 10… you should aim for 100 to 1000 as a very rough guide. Certainly not less than 50. So your goal of a gridsize of 300 is just fine.

Your “unspecified launch failure” is almost certainly just a bug in your kernel… it’s failing, probably by a memory access error like reading past the end of a device array.

Time to fire up cuda-gdb or Nexus to start debugging!

You could also try cuda-memcheck, since this is the Linux forum.

Just run you application with:
cuda-memcheck a.out

It will give the out of bounds location.
========= Invalid read of size 1
========= at 0x000000e8 in mycuda_kernel
========= by thread 3 in block 0
========= Address 0x00111e00 is out of bounds

========= ERROR SUMMARY: 1 errors