I ran into a weird situation/limit that a cuda global function does not get executed if too many parameters passed to the function. This is reasonable, since it’s not possible to have infinite space for parameter list. But I didn’t expect it to be so small and also it does not return any error message (just success) to help me idenify the problem. Furthermore, the not-runnable code is still generated by nvcc without any warning. It takes me several hours to find out, and honestly it’s pretty depressing to see such bug or limit.
In my experiment, I found that the size limit of parameter list is only 256 bytes. As each device pointer takes 8bytes, I can only have 32 device pointers passed to a global function, which is very limited. I didn’t find such info on the programming guide, so I would like to suggest you to include that info.
On the other hand, since device functions are inlined, or said assembled in the same ptx program, I think the limit does not exist, is that right?
The limit is observed on CUDA 2.0 beta on a 64bit ubuntu.