CUDA 10.1
CentOS 7.7
V100-SXM2-16GB
Error: “too many resources requested for launch”
Question 1: Is there any way to get more information on what the problem is? Which resource? How many of that resource are you trying to use? How much of that resource do you actually think you have? Any switches that might give me some more (helpful) information?
Question 2: What might the issue be? Unfortunately I can’t include all the code, but I think this should provide enough information (if not what else should I be looking at?):
pts info : 443 bytes gmem, 288 bytes cmem[3]
ptxas info : Compiling entry function ‘_Z6kernelPhS_S_S_S_PjPm’ for ‘sm_70’
ptxas info : Function properties for _Z6kernelPhS_S_S_S_PjPm
7440 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 113 registers, 408 bytes cmem[0], 8 bytes cmem[2]
ptxas info : 443 bytes gmem, 288 bytes cmem[3]
in my host code, I do all the cudaAlloc (one of which is 4MB/thread) and then see how much memory is available:
memory free= 5738463232 total= 16914055168
I then call my kernel i.e. kernel<<<80,32>>>(…). That call fails with the “too many resource requested for launch” error.
Looking as some potential issues as I understand them:
Shared memory: not using any of my own. Kernel arguments is just 7 pointers so that should be less than 256 bytes (not sure if/where this is documented).
Constant memory: not sure how to add up the ptxas info but it doesn’t seem to me I’d be close to 64KB
Registers: 113 used, that’s <255/thread limit and based on 32 thread/block that’s a total of 3616 reg/block (< 64K limit)
Memory: 7440 bytes/thread * 2048 max resident thread/SM * 80 SM = 1,218,969,600 bytes which is way less than the 5.7GB I supposedly have free. (I’d like to understand this limitation better because, due to the memory requirements of each of my threads I’ll never be able to reach 2048 threads/SM so it seems unrealistic to be limited by that).
A couple extra notes…
-
If I halve that 4MB/thread table to 2MB/thread (change nothing else) then I seem to have enough resources! (Get the wrong answers, but at least it will run ;^()
-
The default stack size comes back as 1024 (cudaLimitStackSize) but setting this to 8192 (i.e. >7440) does not seem to make any difference.
So… I"m not seeing the problem.
I am interested in this particular case but I don’t want to just twiddle things till they work. I’d like to understand the broader question that library creators face when developing a function that might be called by users with different parameters. I have the ptxas info, my understanding of (for a particular usage) internal memory requirement (new/delete, malloc/free) and the launch paramters used, but I’m having trouble putting that all together to set stack/heap and ensure I stay within the allowable resource constraints.
I apologize that this is kinda open ended. I Google’d and found some discussion of these issues, but nothing complete nor comprehensive. If you know of something, please follow up with a link.
Thanks.