Consuming an EGLStream from CUDA causes memory bloat

Hi,
I am trying to reply but my message gets blocked by the forum security rules (Incapsula) and my access is then denied for some time.
Will keep trying …

Hi,

If your message is blocked for too much log, you can attach a file instead.
Thanks.

1.
I used a blank array as input instead of the camera and it does not crash, even with a 1s sleep in the loop.

2.
I already did this in #15, there is no crash without the CUDA to OpenGL copy.

3.
This is what I use in acquire_stream_frame (src/cuda/core.cpp). I also already used the cudaHistogram sample as a reference.

(I removed all code samples from this reply, I think it is flagged as code injection attack, it was not big)

Hi,

Sorry that we don’t have enough time to check your source code in detail.
But there are some clues we found can share with you.

Do you apply EGL map to every camera frame?

cuResult = cuGraphicsResourceGetMappedEglFrame(&cudaEGLFrame, cudaResource, 0, 0);

Argus is a non-buffered producer and will present a new frame every time.
Please remember to call ELG map when you acquire a new image.

Thanks.

Yes, here is the pattern I use as shown in #1:

cuEGLStreamConsumerAcquireFrame
    cuGraphicsResourceGetMappedEglFrame
    cuSurfObjectCreate
    // use frame array / surface
    cuSurfObjectDestroy
    cuEGLStreamConsumerReleaseFrame

Hi,

Could you try to use CUDA_RESOURCE_DESC and surf2D[op] rather than cudaArray_t?
Thanks.

Hi,

I replaced the array copy with the following:

cudaResourceDesc rd;
            memset(&rd, 0, sizeof(rd));
            rd.resType = cudaResourceTypeArray;
            rd.res.array.array = array;

            cudaSurfaceObject_t surface;
            if (!CUDA_CHECK(cudaCreateSurfaceObject(&surface, &rd))) {
                return false;
            }

            cuda_surface_copy(intensity, surface, width, height);

            CUDA_CHECK(cudaDestroySurfaceObject(surface));
            return true;

the copy kernel being:

const size_t BDX = 32;
const size_t BDY = 4;

__global__
static void kernel(
    cudaSurfaceObject_t src,
    cudaSurfaceObject_t dst,
    const size_t width_bytes,
    const size_t height
) {
    for (
        size_t y = blockIdx.y * blockDim.y + threadIdx.y;
        y < height;
        y += blockDim.y * gridDim.y
    ) {
        for (
            size_t x = blockIdx.x * blockDim.x + threadIdx.x;
            x < width_bytes;
            x += blockDim.x * gridDim.x
        ) { // TODO: uint4 vectorisation
            uint8_t data;
            surf2Dread(&data, src, x * sizeof(uint8_t), y);
            surf2Dwrite(data, dst, x * sizeof(uint8_t), y);
        }
    }
}

void cuda_surface_copy(
    cudaSurfaceObject_t src,
    cudaSurfaceObject_t dst,
    const size_t width_bytes,
    const size_t height
) {
    dim3 block_dim(BDX, BDY);
    dim3 grid_dim(udiv_ceil((unsigned int)(width_bytes), block_dim.x),
                  udiv_ceil((unsigned int)(height), block_dim.y));

    kernel<<<grid_dim, block_dim>>>(src, dst, width_bytes, height);

    must(CUDA_CHECK(cudaGetLastError()));
}

It still crashes with the memory bloat error after a few frames.

Hi,

Suppose this is almost identical to the sample cudaHistogram.
Could you check if you can run cudaHistogram on your environment?

Thanks.

Hi,

The cudaHistogram sample runs fine. It also does not show the initial warnings.
The main differences with my code that I can see are:

  • It uses argus multi-process version while I use the single-process one
  • It submits and waits for a bunch of individual capture requests instead of using a repeat request to fill the pipeline in a running loop
  • It does not do any display, and in my case there is no crash without the CUDA to OpenGL copy.

Hi,

Could you help to do another experiment?

Please declare another GL buffer and copy image data into it instead.

...
cuda::GLTarget gl_target;
cuda::GLTarget gl_target_tmp;
...
        if (!cuda::map_gl_target(&gl_target_tmp, [&](auto array) {
...
graphics::render_texture(&render, gl_target.texture, &graphics);

This can help us figure out the issue is from slow Argus->OpenGL or slow OpenGL->rendering.
Thanks.

Hi,
Declaring another GL buffer as you described still crashes.
It does seem to take more time to crash and to only trigger the mutex error and not the memory bloat error.

Hi,

Could you remove the update of display and check it again?

--- graphics::render_texture(&render, gl_target.texture, &graphics);

Thanks

Hi,
With a single GL target and no texture rendering it still crashes with the mutex error.

Hi

In experiment in comment #27, could you try with only read op?

surf2Dread(&data, src, x * sizeof(uint8_t), y);
//            surf2Dwrite(data, dst, x * sizeof(uint8_t), y);

Thanks.

Hi,
I am a bit confused by our workflow here. I gave you a sample code to reproduce the issue, and you successfully did. But then you asked me some trivial questions that are quicker to check by glancing at the code than by asking me:

Now you ask me to apply trivial modifications (literaly commenting a single line) and to report the result when you could very quickly test it on you side by modifying my code sample. I don’t mean to be rude and I understand that you have other things to do but why are we proceeding like this ? It seems like a waste of time for both of us. Anyway, thank you for your help and I will keep reporting on trivial modifications if there is no alternative.

Hi,

We are sorry that our replies cannot meet your expectation.
We prefer to give user suggestions rather than debugging for them, especially for an customized issue.

This error occurs in a custom application and cannot be reproduced in our official sample.
As a result, it’s more likely something incorrect in user implementation.

We are sorry that we cannot figure out the error immediately due to the complicated pipeline.
But we try to give some suggestion to help you debug.

For your issue, we still think there is an illegal access in your application.
Maybe you can find some useful information in this slides.

If you are convinced that there is an issue in our camera or EGL driver.
Please reproduce this issue with our official sample.

Thanks.

Hi,
I believe that the issue is most likely coming from either some misdocumented behavior that would make my accesses illegal or the blackbox implementation. I understand this is a complicated issue and this is precisely why I am asking for your help, I was simply expecting you to be more implicated in the debugging and to experiment on your side as well.

I think my code sample is quite minimalist, even if it is obviously not perfect and made from scratch. If you insist and if it would implicate you in the debugging, I will try and reproduce the issue with your official sample if I find the time to do so. How exactly should I proceed for you to be satisfied with my sample ?

Hi,

Would you please check your program with cuda-memcheck or nvprof?

Memoy bloat may be caused by non-release image buffer.
Since Argus create new buffer each frame, not successful release may cause memory bloat.
(If yes, we still need to check why the buffer is not released.)

Could you help to confirm this?

Thanks.

Hi,

Running memcheck doesn’t display any additional information and ends with:

========= Error: process didn't terminate successfully
========= No CUDA-MEMCHECK results found

If I close the window before it crashes:

========= ERROR SUMMARY: 0 errors

Running nvprof ends with:

==3540== Profiling application: ./build/bin/interop
==3540== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   57.51%  102.24us        94  1.0870us     576ns  22.688us  cuDeviceGetAttribute
                   28.19%  50.112us         1  50.112us  50.112us  50.112us  cudaSetDevice
                    6.80%  12.096us         1  12.096us  12.096us  12.096us  cuDeviceTotalMem
                    4.39%  7.8080us         3  2.6020us     992ns  4.9280us  cuDeviceGetCount
                    1.75%  3.1040us         2  1.5520us  1.3760us  1.7280us  cuDeviceGet
                    1.37%  2.4320us         1  2.4320us  2.4320us  2.4320us  cuDeviceGetName
======== Error: Application received signal 134

Hi,

As comment #24 said, please apply following procedure for each frame:

while(...){
  // Argus
  cuEGLStreamConsumerAcquireFrame -> Map EGL frame
  // Display
  Map EGL -> memory copy -> Unmap EGL
  // Argus
  cuEGLStreamConsumerReleaseFrame
}

Here is a sample for your reference:
https://github.com/dusty-nv/jetson-inference/blob/master/imagenet-camera/imagenet-camera.cpp#L183

Thanks.