I am using a custom cuda code written for mex compiling in matlab. This code has worked without any issue on the GeForce 660-680 cards. I have upgraded to a GeForce 1060 and am now getting the cudaerror ‘too many resources requested for launch’. I have tried changing the number of threads per block, but even when reducing down to 1 the error persists. I am sure that the data is within the memory limits of the gpu and am very confused at what is going on. Pastebin is down, so here is a link via tinypaste http://pasted.co/6be70001. I aplogize in advance for formatting. I require the use of the determinate and inverse of 6x6 matrices which can get quite verbose.
I did not follow the provided link and did not look at the code.
Have you checked register usage? That may be the resource that is oversubscribed. The backend compiler PTXAS has architecture-specific code generation (including register allocation), therefore a different number of registers maybe used on different architectures. Also, different GPU architectures provide a different granularity of register allocation, so even if the code requires the same number of registers what is actually allocated may differ.
I’m not sure how to do that because I’m compiling with MATLAB’s mex compiler. I’m trying to find out how so I can answer your question.
I cannot find a way to output the number of registers being allocated for this task. Assuming that this is the problem, is the solution attempting to rewrite the code to reduce the number of local variables being used? When I wrote this initially I tried to make it as conservative in memory as possible and I am worry that I would not be able to reduce the variables I am using.
Are you using mex or mexcuda (released in R2015b) to compile that code?
If mexcuda, what are the contents of your mex gpu options file?
If you’re not sure what that is, try googling for help. The location of that file will vary depending on whether you are on windows or linux, and which version of matlab you are using, none of which you’ve indicated.
I have zero experience with MATLAB. While you are at it, it would probably be best to extract the entire build log which shows how exactly nvcc was invoked by MATLAB, to double-check that other than specifying a different target architecture (6.1 instead of 3.0) no other compilation switch changes were introduced inadvertently.
What were you looking for in the mex gpu options file, here is a link to the file http://pastebin.com/WFBsqP7T. I am running on windows 7 sp1, MATLAB 2017a. I have the same issues regardless of whether I run mex or mexcuda.
Make a copy of that file and place it in the directory where you are running mexcuda from.
modify that new file with the following change:
COMPFLAGS="–ptxas-options=-v --compiler-options=/Zp8,/GR,/W3,/EHs,/nologo,/MD $ARCHFLAGS"
rerun your mexcuda command with -v:
mexcuda -v …
and see if the modification to COMPFLAGS shows up in the correct place in the nvcc compilation sequence emitted by mexcuda (with the verbose switch). You may need to experiment with this approach to find the correct variable (instead of COMPFLAGS) to get this to work.
The goal is to issue the ptxas command to nvcc that will cause the registers used and other info to be displayed. This info may point out the resource issue that is preventing your code from running properly.
Thank you, I was able to get nvcc to display the registers being utilized. The output can be found here http://pastebin.com/xBjfstZK. This isn’t to say I have any idea where to go from here.
I am confused by the log. You said your platforms were GTX 6x0 and GTX 1060, so I would have expected the target architecture to be specified as sm_30 for GTX 680 and sm_61 for GTX 1060. That is not what I am seeing. It is a bit of work to get a side-by-side overview of resource usage for a given kernel from these raw logs, and I haven’t put in the effort to do the extraction. Such a comparison might be quite telling.
I am not completely sure, but the high stack usage of some of the kernels suggests that they may be exceeding the default stack allocation (which I think is 2KB per thread). On the other hand, I thought that the driver would automatically grow the stack as necessary when loading a kernel, as long as the stack usage for a kernel was known at compile time (not always the case, e.g. when recursive functions are used). As an experiment, try increasing the stack size with cudaDeviceSetLimit().
Your GTX 1060 is a compute capability 6.1 device. As an aside, you might want to modify your options file again as follows:
ARCHFLAGS="-gencode=arch=compute_20,code=sm_20 -gencode=arch=compute_30,code=sm_30 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_60,code=\"sm_60,compute_60\" $NVCC_FLAGS"
but this is an aside and is not essential to your question. I wouldn’t do it first, I would do it later, as an experiment, for amusement purposes.
The relevant output for your cc 6.1 device should be this:
ptxas info : Compiling entry function '_Z9localize9PdS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_S_S_iS_S_S_i' for 'sm_60' ptxas info : Function properties for _Z9localize9PdS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_S_S_iS_S_S_i 648 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 253 registers, 1296 bytes smem, 500 bytes cmem, 600 bytes cmem ptxas info : Compiling entry function '_Z9localize7PdS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_S_S_iS_S_S_i' for 'sm_60' ptxas info : Function properties for _Z9localize7PdS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_S_S_iS_S_S_i 392 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 253 registers, 784 bytes smem, 500 bytes cmem, 600 bytes cmem ptxas info : Compiling entry function '_Z10localize11PdS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_S_S_iS_S_S_i' for 'sm_60' ptxas info : Function properties for _Z10localize11PdS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_S_S_iS_S_S_i 968 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 253 registers, 1936 bytes smem, 500 bytes cmem, 600 bytes cmem
based on the above, I’m a bit skeptical that the code you originally showed in the paste link is what is actually being compiled. You only indicate a localize7 kernel in that code, but the above output indicates the presence of a localize9 and localize11 kernel also. So comments below will have to be interpreted by you.
253 registers per thread usage for a cc6.1 device:
means that you could support a maximum kernel launch of 65536/253 = 256 threads per block. You’ve already stated you cannot get it to work with even 1 thread per block, but you might want to scrub that and see if you may have made some sort of mistake during that test (for example did you modify all kernel launches to only use 256 threads per block?)
If you’re unsuccessful with that, then one of the other reported resource utilizations is the culprit. Normally shared memory is not an issue, but I haven’t studied your code carefully. Are you attempting to use more than 48Kbytes of shared memory? It didn’t appear that you were for the localize7 kernel.
The stack frame, which is essentially local memory usage, is also something to check. 1024 bytes of stack frame does not seem out of whack to me, however. A quick calculation for GTX 1060 6GB (do you have the 3GB version?):
10 SMs * 2048 threads * 1024 bytes/thread = 20MB
which should not be a problem from stack/lmem usage standpoint, I don’t think.
I thought I saw kernels with ~2200 bytes of stack frame in the logs? Did I mis-read? I think the default stack is only 2KB, so 2200 bytes would exceed that.
Other than usage of registers, shared memory, and stack space, what could lead to an “out of resources” error during kernel launch? Nothing comes to mind right now.
Again, sorry for the confusion. The original code, which was the one compiled with the flags, contain functions for different sized images 7x7, 9x9, 11x11, as most of that code was redundant I removed the 9x9 and 11x11 for posting. I have been compiling both and the same issues appear with both versions and I just checked and verified that the output for the reduced code I posted is identical to the localize7 part of the output I posted above (I’m including a link to the reduced output for reference http://pastebin.com/ixexEihE)
As for the gpus. The GTX 660 is not installed on this rig. I mentioned it to help explain that this code was working and the change of GPU was what caused these issues.
There is only 1 kernel call in this code and I am sure that the threadsperblock was what I thought it was (ranging from 1024 to 1)
Local memory should be 784 bytes. My only thought is that I might be allocating that every thread which could possibly cause memory issues, but I’d expect if that was the problem it would go away when I use only 1 thread (or 1 set of data) and the problem still exists.
I am using the 6 GB version of the 1060.
I apologize for the easily avoidable confusion. I very much appreciate your patience in teaching me to resolve these issues.
Njuffa, to respond to your experiment, I tried doing that, nothing seemed to happen.
Yes, some were as high as ~2200 bytes, but they were for sm_20 objects. I don’t think those should be relevant on a cc6.1 device. There is no cc 2.x PTX in the ARCHFLAGS. The only embedded PTX is compute_60 AFAICT.
I’m pretty much out of ideas too. JIT compilation can modify resource usage under the hood, but I don’t think sm_60 sass JITs on a cc6.1 device, although I could be wrong. I think cc6.1 supports sm_60 sass natively, but I haven’t run the experiment to verify that. This particular sequence does appear to include compute_60 PTX.
As a last gasp, in your mex file, perhaps you should add a mexprint statement to print out the number of threads per block you are using on the kernel launch, to verify that you are correctly passing this parameter from matlab to the mexfunction.
To eliminate as much uncertainty as possible (including from JIT compilation), it would best to ensure that the build delivers SASS for the sm_61 architecture when running with the GTX 1060.
OK so this problem is now fixed. Txbob was closest without going over. Turns out my threadsperblock variable was being initialized as an integer. I changed that to a size_t variable and all problems, in both version of the code, completely disappeared. It is still important to keep the threadsperblock < 253, but the code is now working as expected, for whatever reason.
Thank both of you Txbob and njuffa for taking your time to help me figure this out, I am extremely grateful as I was just completely lost.
I will readily admit that this makes no sense to me. Since signed 32-bit integers can hold values up to 2**31-1 ~ 2 billion, there should be no risk of overflowing a threads_per_block count that requires the use of a size_t variable.