OK. I admit there is an error pops out after I added getLastCudaError(“Kernel execution failed”).
It reports too many resources requested for launch.
um_rand.cu(93) : getLastCudaError() CUDA error : Kernel execution failed : (7) too many resources requested for launch.
It is a simple kernel, where 1 blk with 1024 threads is launched, and only 32 threads are actively running. When I compiled the program, there are no warnings at all. It is very interesting!
Then I check the resource usage by adding the “-res-usage” option.
ptxas info : 77712 bytes gmem, 72 bytes cmem[3]
ptxas info : Compiling entry function ‘_Z21kern_rng_using_cuRandPfji’ for ‘sm_70’
ptxas info : Function properties for _Z21kern_rng_using_cuRandPfji
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 88 registers, 368 bytes cmem[0], 80 bytes cmem[2]
I found that for such a small kernel, there are 88 registers per thread are needed! The limit for regs per sm is 65536 on V100. Therefore, if the driver statically partitions the resources, it should have register spilling ( max 65536 / 88 = 745 threads are allowed per sm). if the driver dynamically partitions the resources (N = 32), there should be enough resources.
There is no shared memory usage for this kernel.
After I changed curandStateMRG32k3a back to curandState, the kernel used much fewer registers!
Reduce the reg usage from 88 to 54. And the program runs as expected without any issues.
ptxas info : 77712 bytes gmem, 72 bytes cmem[3]
ptxas info : Compiling entry function ‘_Z21kern_rng_using_cuRandPfji’ for ‘sm_70’
ptxas info : Function properties for _Z21kern_rng_using_cuRandPfji
6456 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 54 registers, 368 bytes cmem[0]
Again, I think this is bizarre behavior.
Thanks for the suggestion!