I have recently been posting on the general CUDA forum when I think I should have been posting on this one. This post relates to the emulation v GPU execution outputs disagreeing.
Having added the following line to my code after the call to the kernel
printf(“\n\n%s\n”, cudaGetErrorString(cudaGetLastError()));
the following is output
“too many resources requested for launch”
Rispek’ to E.D. Riedijk for nailing that! External Image
Appendix A.1.1 lists restrictions on memory, threads, registers etc
I have calculated the amount of memory I am using (all in global at the moment but that will change as soon as I get this problem sorted out) and I am well within the max for global memory.
Am I using too many register variables? I have just calculated that I have 74 register variables in total over one kernel and nine device functions , one of which has 25 register variables, for each thread. But according to A.1.1 there are 8192 registers in total. If the largest function with 25 register variables was executing concurrently on 128 threads then this would require 9472 registers. However I do not get a insufficient resource error when I execute on 128 threads ie 4 warps.
So ???
Does CUDA report or give a clue as to which resource or resource type is insufficient?