Device Query and the Programming Reference say that my 9600GT has a maximum block size of 512 but if I set it anywhere above 64 my program doesn’t terminate where it would normally terminate in 1 second with a block size of 64. Not sure if it’s not working or just running really slowly and why? External Image Please help!
Hardware limit for block size is 512.
However, this doesn’t mean that you can always have that number of threads in a block: you have to make sure your kernel doesn’t run out of resources. Inspect your .cubin file (use --keep or --ptxas-options=-v option for nvcc) and find out how many registers and shared memory bytes your kernel reqires. Keeping in mind that there are 8192 registers per block and slightly less that 16Kb of shared memory per block you can now calculate max. number of threads you can run in a block.
-cubin output is:
ptxas info : 64 bytes lmem, 20 bytes smem, 1242688 bytes cmem, 65 registers
Surely that’s not too high?
65 regs means you can have blocks of up to floor(8192/65)=126 threads.
BTW, what’s return code of failing kernel (you can obtain it by running cudaThreadSynchronize() after kernel launch)?
You may also have some problems with reading/writing past the end of allocated memory…
The kernel is not returning at all when the thread size is greater than 64. I want to add some debugging output to see why but need to add -deviceemu to the build command just don’t know where to put it. Using visual studio and my current build command is:
"$(CUDA_BIN_PATH)\nvcc.exe" -ccbin "$(VCInstallDir)bin" -c -D_DEBUG -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/RTC1,/MTd -I"$(CUDA_INC_PATH)" -I./ -I../../common/inc -o $(ConfigurationName)\$(InputName).obj $(InputFileName)
When I put it inplace of --ccbin I get the error.
Works fine in emulation mode, but does not return in normal mode, can’t get any errors codes anywhere?
Got an error from the kernel launch:
error: too many resources requested for launch
That’s exactly what I’ve been talking about. Your kernel reures too many registers and/or shared memory to launch.