Weird CUDA problem: changing += to /= in a loop causes a variable not to be set

I have a loop in a CUDA kernel as follows:

__global__ void CreateModelsfromStatsKernel(Buffers buf, const CudaFrameInfo cu)
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int start_row = row * cu.rectHeight;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int start_col = col * cu.rectWidth;
    int x_offset, y_offset;
    uint32_t offset;
    uint32_t box_offset = (row * cu.numRectsX) + col;

    for(x_offset = 0; x_offset < cu.rectWidth; x_offset++)
	{
		for(y_offset = 0; y_offset < cu.rectHeight; y_offset++)
		{
			offset = (start_row + y_offset) * cu.fbWidth + start_col + x_offset;
			buf.IavgF[offset] /= buf.Icount[box_offset];

            buf.boxMinMean[box_offset] = 37;
		}
	}
}

buf.IavgF and buf.Icount are created using cudaMalloc while buf.boxMinMean was created using cudaMallocManaged.

In the calling host C code after using cudaDeviceSynchronise, I see that this line has no effect:

buf.boxMinMean[box_offset] = 37;

i.e. calling the kernel does NOT set this value.

Now, if I remove the following line or change /= to +=:

buf.IavgF[offset] /= buf.Icount[box_offset];

All of a sudden, I see buf.boxMinMean[box_offset] set to 37 in the host code as expected. What on earth is going on? I have spent most of the day trying to work out just why CUDA is not working.

I thought maybe divide by zero is the problem but I used cudaMemSet to set the array buf.Icount to 1s and it did not change things.

What is going wrong and how can I debug it?

Many thanks.

I wouldn’t be able to answer what is going wrong without a complete test case. A kernel, by itself, is not a complete test case.

  1. Use proper CUDA error checking. (google that, take the first hit, apply it to your code).
  2. Use cuda-memcheck or compute-sanitizer on your code.
  3. Use in kernel printf to confirm that the kernel is running and that the values written are as expected
  4. Use a debugger like cuda-gdb
1 Like

Hi thanks for that. I tried to write a test case using the same sizes of buffers etc but could not reproduce the problem.
It only exists in the complicated setup that I have.

I was hoping that the really weird change of /= to += might mean something…

I have made sure all my cuda*() functions are returning good values.

printf() in the CUDA kernel only works when this line:

buf.IavgF[offset] /= buf.Icount[box_offset];

is changed to

buf.IavgF[offset] += buf.Icount[box_offset];

cuda-memcheck just gave this message a lot:

========= Program hit cudaErrorDevicesUnavailable (error 46) due to "all CUDA-capable devices are busy or unavailable" on CUDA API call to cudaFree. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2fdb04]
=========     Host Frame:/usr/local/cuda-10.2/targets/aarch64-linux/lib/libcudart.so.10.2 (cudaFree + 0x13c) [0x3bdd4]
=========     Host Frame:./video_dec_cuda [0x4f5c4]
=========     Host Frame:./video_dec_cuda [0x9544]
=========     Host Frame:./video_dec_cuda [0x44f0c]
=========     Host Frame:/lib/aarch64-linux-gnu/libpthread.so.0 [0x7088]
=========
cuGraphicsEGLRegisterImage failed: 201, cuda process stop
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuGraphicsEGLRegisterImage. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 (cuGraphicsEGLRegisterImage + 0x11c) [0x1f573c]
=========     Host Frame:./video_dec_cuda [0x4f5d8]
=========     Host Frame:./video_dec_cuda [0x9544]
=========     Host Frame:./video_dec_cuda [0x44f0c]
=========     Host Frame:/lib/aarch64-linux-gnu/libpthread.so.0 [0x7088]

I tried using cuda-gdb but the application did not run properly - it should have popped up an overlay with some decoded H264 video on:

NVIDIA (R) CUDA Debugger
10.2 release
Portions Copyright (C) 2007-2020 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-elf-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./video_dec_cuda...done.
(cuda-gdb) run ~/Training+detection\ 1280x720.264 H264
Starting program: /nfs/mmapi_samples/samples/02_video_dec_cuda/video_dec_cuda ~/Training+detection\ 1280x720.264 H264
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Opening in BLOCKING MODE 
NvMMLiteOpen : Block : BlockType = 261 
NVMEDIA: Reading vendor.tegra.display-size : status: 6 
[New Thread 0x7fb1adcdb0 (LWP 18933)]
[New Thread 0x7fb12dbdb0 (LWP 18934)]
[New Thread 0x7fb0adadb0 (LWP 18935)]
NvMMLiteBlockCreate : Block : BlockType = 261 
[New Thread 0x7fab685db0 (LWP 18936)]
[New Thread 0x7faae84db0 (LWP 18937)]
Starting decoder capture loop thread
Video Resolution: 1280x720
[INFO] (NvEglRenderer.cpp:110) <renderer0> Setting Screen width 1280 height 720
[New Thread 0x7faa683db0 (LWP 18938)]
libv4l2_nvvidconv (0):(802) (INFO) : Allocating (26) OUTPUT PLANE BUFFERS Layout=1
libv4l2_nvvidconv (0):(818) (INFO) : Allocating (26) CAPTURE PLANE BUFFERS Layout=0
[New Thread 0x7f9b7fedb0 (LWP 18939)]
[New Thread 0x7f9affddb0 (LWP 18940)]
[New Thread 0x7f9a7fcdb0 (LWP 18941)]
Query and set capture successful
fatal:  The CUDA driver initialization failed. (error code = CUDBG_ERROR_INITIALIZATION_FAILURE(0x14)
(cuda-gdb) [Thread 0x7f9affddb0 (LWP 18940) exited]
[Thread 0x7f9a7fcdb0 (LWP 18941) exited]
[Thread 0x7fb1adcdb0 (LWP 18933) exited]
[Thread 0x7fb538fa10 (LWP 18929) exited]
[Thread 0x7fb0adadb0 (LWP 18935) exited]
[Thread 0x7faae84db0 (LWP 18937) exited]
[Thread 0x7faa683db0 (LWP 18938) exited]
[Thread 0x7fab685db0 (LWP 18936) exited]
[Thread 0x7fb12dbdb0 (LWP 18934) exited]

If the kernel is not running (seems to be the case on no printout) then I find it hard to believe you are doing proper CUDA error checking and also getting no error output. But of course I can’t prove that.

You may get better help on Jetson by asking on one of the jetson forums corresponding to your Jetson device.

1 Like

Thanks, I did not realise that I should use cudaGetLastError() after my own kernel function as well.

I received this error:

CUDA error: too many resources requested for launch NvAnalysis.cu

If I do not make use of the 1920 byte unified memory buffer (buf.boxMinMean) in my kernel, I can launch 160 x 6 x 2 kernels (xthreads x ythreads x blocks).

If I make use of buf.boxMinMean in my kernel, I can only successfully launch 128 x 6 x 2 kernels.

I am only passing a total of 94 bytes of function arguments to each kernel which is below the 256 byte limit so the problem must lie elsewhere.

How can I find out what resources limit I am breaking?

The usual problem here is breaking the limit on registers per thread. There is a fairly high limit on this (usually something like 255) but the effective limit may be lower when you consider the total registers used by a threadblock and compare that to the total registers available on an SM. There are many many questions on this on these and other forums (here is an example).

You can find out how many registers your code is using (one way, at least) by asking the compiler for that information by adding a compile switch to the command line:

-Xptxas=-v
1 Like

Thanks. I added this and the entry for the function that is causing me problems is thus:

ptxas info    : Compiling entry function '_Z27CreateModelsfromStatsKernel13WebcamBuffers13CudaFrameInfott' for 'sm_72'
ptxas info    : Function properties for _Z27CreateModelsfromStatsKernel13WebcamBuffers13CudaFrameInfott
    104 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 34 registers, 104 bytes cumulative stack size, 446 bytes cmem[0]

It doesn’t look like a registers per thread issue based on that output. Did you compile with /= instead of += ? I don’t think I would be able to proceed further without an actual test case. Something doesn’t add up here.

1 Like

Would you accept a patch to the mmapi samples from L4T?

I am unable to get my standalone program showing the same behaviour, probably because it does not allocate a CUeglFrame from a EGLImageKHR.

This is for Jetson Nano platform.

Edit: yes I compiled with the problem case of “/=”.

If you have a kernel that is not running, and reports the error “too many resources requested for launch”, that does not depend on anything else in your code. You should be able to build a test case around that kernel (a complete code, the only thing it does is launch that kernel) and reproduce the problem. There’s almost enough information in this thread already for me to do that, except that I don’t have definitions of Buffers and CudaFrameInfo and it would be anyways inconvenient for me to reverse-engineer the dimensions of embedded data, etc.

If you continue to want help, I suggest you follow that path. Build a complete test case that only launches CreateModelsfromStatsKernel but otherwise duplicates data sizes/dimensions, kernel launch config, etc. You should be able to run that and reproduce the “too many resources requested for launch” error. At that point the problem should be solvable.

1 Like

If you are running on Jetson Nano, why are you showing compilation output for cc7.2?

Jetson nano is compute capability 5.3

Sorry, here is the requested entry:


ptxas info    : Compiling entry function '_Z27CreateModelsfromStatsKernel13WebcamBuffers13CudaFrameInfott' for 'sm_53'
ptxas info    : Function properties for _Z27CreateModelsfromStatsKernel13WebcamBuffers13CudaFrameInfott
    48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 22 registers, 48 bytes cumulative stack size, 366 bytes cmem[0]

I don’t happen to have a Jetson nano to run on, so if you are able to create a standalone test case, you might want to post that question on the Jetson Nano forum.

1 Like

Okay, good news. I have a test case file. It does exhibit the problem but I forgot to compile it with the Nano-specific command line.

/usr/local/cuda-10.2/bin/nvcc -gencode arch=compute_53,code=sm_53 test_code2.cu

test_code2.cu (5.1 KB)

So would you like me to take this over to the Nano forum now?

Yes. When I run it on a non-jetson GPU, I don’t have any trouble with it.

1 Like