cudaLaunch returned (0x2)

Hi Everyone,

I’m a bit new, so I’m sorry in advance. :)

I’ve encountered something weird, and am unsure whether it is a bug, misuse of the hardware or just a misunderstanding. I’ve searched for similar problems, but no suggestion seemed to help.

When attempting to run a kernel from host, we get this:
CUDA error while running kernel: /home/Velo/History/2013/2013-01-14.dat , err: out of memory

Now, I have no idea how it got name for the kernel but that seems besides the point.
When debugging, I also see the following error:
warning: Cuda API error detected: cudaLaunch returned (0x2)
According to the documentation, Error 0x2 means the API call failed because it was unable to allocate enough memory to perform the requested operation.

I’m not at liberty to post the contents of the program, and it is very long. However I can say the following:

  • The error persists even when the amount of free memory, per cudaMemGetInfo, is up to 900mb.
  • All memory for the kernel is pre-allocated before the call. No cudaMalloc calls inside.
  • The system worked prior to a few changes. Outside of processing, the changes affected the pre-allocated memory and added one statically-defined boolean inside the kernel.
  • The point where the code fails is upon entry to the kernel itself, not within it. The following line fails never reaching the first lines of code inside the kernel: ``` CalculationKernell<<>>( arrLen, arr, kernelData, results, log); ```

I find it hard to believe our kernel managed to gobble up nearly a GB worth of space. Therefore, I dug a little deeper and noticed that the error (0x2) might be because of shared memory requirement, but haven’t found any way I can determine how much does my kernel use.

Setup info:
OS: Ubuntu 14.04.2 (Without X server)
GPU: 4xGeForce GTX 780 Ti
Cuda Release: 6.5
Using Nsight (Eclipse) for Linux.

Thanks in advance.

“The system worked prior to a few changes. Outside of processing, the changes affected the pre-allocated memory and added one statically-defined boolean inside the kernel.”

show the shared memory you allocate inside the kernel perhaps

and what is the size of ‘threads’ - you block dimension?

“show the shared memory you allocate inside the kernel perhaps”
I think that is exactly what I’m missing. How do I show the shared memory I allocate?

“and what is the size of ‘threads’ - you block dimension?”
1 block, 512 threads.

you note:

a) x in: kernel<<<dGx, dBx, x, s>>>
b) all allocations inside the kernel starting with shared

for you to be using shared memory, you need at least one of a) or b)

“A summary on the amount of used registers and the amount of memory needed per
compiled device function can be printed by passing option -v to ptxas :”

in eclipse, you can set the -v flag as part of the settings
i would not know about win… that other pseudo os

compile your code with the additional compiler switch -Xptxas -v

If you can post the output generated by that, it may be useful.

I don’t think this issue (0x02 error from cudaLaunch) has anything to do with shared memory usage. My first guess would be local/stack usage. The -Xptxas -v output would help to quickly confirm or rule that out.

@little_jimmy, I use 0 shared memory. None of my variables have the shared prefix, nor do I use the third kernel launch argument.

Here’s the ptxas output:

ptxas info    : 73 bytes gmem
ptxas info    : Function properties for _Z21SignalAUR_ProcessTickR14GpuAURSignal_tRK9GpuTick_t
    32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads
ptxas info    : Function properties for _Z26Base_UpdateSignalsR17GpuBase_tRK9GpuTick_tRK15TConfig_t
    32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads
ptxas info    : Function properties for cudaMalloc
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z9UpdateStatusR15GpuStatManager_tRK9GpuTick_tRK12IConfig_t
    8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads
ptxas info    : Function properties for _Z21SignalAUF_ProcessTimeR14GpuAUFSignal_td
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN4dim3C1Ejjj
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for cudaOccupancyMaxActiveBlocksPerMultiprocessor
    32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z28StatManager_ProcessEventR13GpuEvtData_tR15GpuStatManager_tRK12IConfig_tP14GpuEvtList_tRK9GpuTick_t
    48 bytes stack frame, 48 bytes spill stores, 48 bytes spill loads
ptxas info    : Function properties for _Z19SignalAUD_ProcessTimeR12GpuAUDSignal_td
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z25Base_ProcessRequestR17GpuBase_tRK9GpuTick_t11ReqAction
    16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads
ptxas info    : Function properties for cudaDeviceGetAttribute
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z23CalcActiveR10GpuReq_tRK9GpuTick_tdP13GpuEvtData_t
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z19SignalAG_ProcessTimeR12GpuAGSignal_td
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z6OnEventR13GpuEvtData_tR17GpuBase_tRK12IConfig_tP14GpuEvtList_tRK9GpuTick_t
    16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads
ptxas info    : Function properties for _Z19SignalBF_ProcessTickR12GpuBFSignal_tRK9GpuTick_t
    96 bytes stack frame, 92 bytes spill stores, 92 bytes spill loads
ptxas info    : Function properties for _Z5roundf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z19SignalBUG_ProcessTickR12GpuBUGSignal_tRK9GpuTick_t
    8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads
ptxas info    : Compiling entry function '_Z11DerivedjPK9GpuTick_tP16GpuDerived_tP15GpuStatManager_tP14GpuEvtList_t15TConfig_t17SConfig_t12IConfig_t' for 'sm_35'
ptxas info    : Function properties for _Z11DerivedjPK9GpuTick_tP16GpuDerived_tP15GpuStatManager_tP14GpuEvtList_t15TConfig_t17SConfig_t12IConfig_t
    7728 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 76 registers, 8000 bytes cumulative stack size, 448 bytes cmem[0]
ptxas info    : Function properties for cudaGetDevice
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z26CalcPassiveR10GpuReq_tRK9GpuTick_tdP13GpuEvtData_t
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z19SignalDD_ProcessTickR12GpuDDSignal_tRK9GpuTick_t
    24 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads
ptxas info    : Function properties for _Z20ProcessEventsR17GpuBase_tRK9GpuTick_tS3_RK17SConfig_tRK12IConfig_tP14GpuEvtList_t
    208 bytes stack frame, 64 bytes spill stores, 64 bytes spill loads
ptxas info    : Function properties for _Z19SignalDF_ProcessTickR12GpuDFSignal_tP11AnySignal_t
    32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads
ptxas info    : Function properties for _Z3maxdd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZSt3absd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z19SignalDS_ProcessTickR12GpuDSSignal_tP11AnySignal_t
    40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads
ptxas info    : Function properties for _Z23Derived_ProcessTickR16GpuBase_tRK9GpuTick_tS3_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for cudaFuncGetAttributes
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z8ClosePendingEventsR17GpuBase_tRK9GpuTick_t
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z13IsTimedOutdRK17SConfig_t
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z16OnEvtRejectR17GpuBase_t
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z21SignalDUG_ProcessTickR14GpuDUGSignal_tRK9GpuTick_t
    24 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads
ptxas info    : Function properties for _ZN4dim3C2Ejjj
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

i would think that you are running out of registers and local memory, particularly with the block size you are using

ptxas info : Function properties for _Z11DerivedjPK9GpuTick_tP16GpuDerived_tP15GpuStatManager_tP14GpuEvtList_t15TConfig_t17SConfig_t12IConfig_t
7728 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

“I’m not at liberty to post the contents of the program, and it is very long.”

do you call any device functions from within device kernel(s)?
what is the a) average, b) highest kernel/ function parameter count of your device kernels/ functions?
can you show your device kernel/ function declarations?

Double post. Edited away.

I’ve managed to squeeze what I need, so the problem is resolved.

A few questions out of curiosity:

  • I've noticed you didn't quote the cumulative stack size, isn't that number useful? Used 76 registers, 8000 bytes cumulative stack size, 448 bytes cmem[0]
  • What can be done in such cases, except for offloading data to L1/Texture/RO memory?
  • My code is mostly within device functions, averaging 5 arguments being almost exclusively pointers. What is the effect of that on the kernel memory requirements?

Anyway, big thanks for the help (I’ve learned about ptxas).

rather good questions; i am not sure i know all the answers

“I’ve noticed you didn’t quote the cumulative stack size”

i would think that ‘cumulative stack’ means stack accumulation, or stack already accumulated
from such a viewpoint, and given the excessive value, it would be a second trip line, not the first

“What can be done in such cases, except for offloading data to L1/Texture/RO memory?”

i think this generally occurs because of at least 2 things:
significant register use by functions/ code
significant kernel/ function parameters

hence, anything that work in on this should be rewarding
code sections ({}) help to recycle registers i hope, and i thus favour these, particularly when the kernel or function becomes lengthy and takes a number of functional turns
and i normally limit kernel parameters, particularly when calling device functions from kernels, as these tend to end up consuming registers
i like to roll up parameters and pass them via shared or global memory, one way or the other; instead of passing pointers, passing along a pointer to an array of pointers
parameters are more static, and should thus cache well, i would hope, particularly when directed to a separate cache

“My code is mostly within device functions, averaging 5 arguments being almost exclusively pointers”
“What is the effect of that on the kernel memory requirements”
i think that kernel parameters are passed via constant memory, which sounds helpful as constant memory is quick to access
but threads need ‘handles’ to the constant memory; either way, many times you would note that this seems to increase register use/ register pressure; note the changes in ptxas when increasing/ decreasing the number of kernel parameters
this is further exacerbated when the kernel is lengthy and ends up calling numerous device functions, as the union or collection of kernel/ function parameters ends up as the parameters of the parent kernel, again potentially placing enormous pressure on register use

i think nvidia is well aware of this; i recall gtc made mention of improvements in this regard with future architectures

Great info, thank you.