Consuming an EGLStream from CUDA causes memory bloat

basta.t.k · June 11, 2018, 3:16pm

Hi,
I am trying to reply but my message gets blocked by the forum security rules (Incapsula) and my access is then denied for some time.
Will keep trying …

AastaLLL · June 12, 2018, 2:11am

Hi,

If your message is blocked for too much log, you can attach a file instead.
Thanks.

basta.t.k · June 12, 2018, 8:50am

1.
I used a blank array as input instead of the camera and it does not crash, even with a 1s sleep in the loop.

2.
I already did this in #15, there is no crash without the CUDA to OpenGL copy.

3.
This is what I use in acquire_stream_frame (src/cuda/core.cpp). I also already used the cudaHistogram sample as a reference.

(I removed all code samples from this reply, I think it is flagged as code injection attack, it was not big)

AastaLLL · June 15, 2018, 7:23am

Hi,

Sorry that we don’t have enough time to check your source code in detail.
But there are some clues we found can share with you.

Do you apply EGL map to every camera frame?

cuResult = cuGraphicsResourceGetMappedEglFrame(&cudaEGLFrame, cudaResource, 0, 0);

Argus is a non-buffered producer and will present a new frame every time.
Please remember to call ELG map when you acquire a new image.

Thanks.

basta.t.k · June 15, 2018, 12:44pm

Yes, here is the pattern I use as shown in #1:

cuEGLStreamConsumerAcquireFrame
    cuGraphicsResourceGetMappedEglFrame
    cuSurfObjectCreate
    // use frame array / surface
    cuSurfObjectDestroy
    cuEGLStreamConsumerReleaseFrame

AastaLLL · June 19, 2018, 9:02am

Hi,

Could you try to use CUDA_RESOURCE_DESC and surf2D[op] rather than cudaArray_t?
Thanks.

basta.t.k · June 21, 2018, 9:54am

Hi,

I replaced the array copy with the following:

cudaResourceDesc rd;
            memset(&rd, 0, sizeof(rd));
            rd.resType = cudaResourceTypeArray;
            rd.res.array.array = array;

            cudaSurfaceObject_t surface;
            if (!CUDA_CHECK(cudaCreateSurfaceObject(&surface, &rd))) {
                return false;
            }

            cuda_surface_copy(intensity, surface, width, height);

            CUDA_CHECK(cudaDestroySurfaceObject(surface));
            return true;

the copy kernel being:

const size_t BDX = 32;
const size_t BDY = 4;

__global__
static void kernel(
    cudaSurfaceObject_t src,
    cudaSurfaceObject_t dst,
    const size_t width_bytes,
    const size_t height
) {
    for (
        size_t y = blockIdx.y * blockDim.y + threadIdx.y;
        y < height;
        y += blockDim.y * gridDim.y
    ) {
        for (
            size_t x = blockIdx.x * blockDim.x + threadIdx.x;
            x < width_bytes;
            x += blockDim.x * gridDim.x
        ) { // TODO: uint4 vectorisation
            uint8_t data;
            surf2Dread(&data, src, x * sizeof(uint8_t), y);
            surf2Dwrite(data, dst, x * sizeof(uint8_t), y);
        }
    }
}

void cuda_surface_copy(
    cudaSurfaceObject_t src,
    cudaSurfaceObject_t dst,
    const size_t width_bytes,
    const size_t height
) {
    dim3 block_dim(BDX, BDY);
    dim3 grid_dim(udiv_ceil((unsigned int)(width_bytes), block_dim.x),
                  udiv_ceil((unsigned int)(height), block_dim.y));

    kernel<<<grid_dim, block_dim>>>(src, dst, width_bytes, height);

    must(CUDA_CHECK(cudaGetLastError()));
}

It still crashes with the memory bloat error after a few frames.

AastaLLL · June 25, 2018, 8:10am

Hi,

Suppose this is almost identical to the sample cudaHistogram.
Could you check if you can run cudaHistogram on your environment?

Thanks.

basta.t.k · June 25, 2018, 8:41am

Hi,

The cudaHistogram sample runs fine. It also does not show the initial warnings.
The main differences with my code that I can see are:

It uses argus multi-process version while I use the single-process one
It submits and waits for a bunch of individual capture requests instead of using a repeat request to fill the pipeline in a running loop
It does not do any display, and in my case there is no crash without the CUDA to OpenGL copy.

AastaLLL · June 29, 2018, 6:16am

Hi,

Could you help to do another experiment?

Please declare another GL buffer and copy image data into it instead.

...
cuda::GLTarget gl_target;
cuda::GLTarget gl_target_tmp;
...
        if (!cuda::map_gl_target(&gl_target_tmp, [&](auto array) {
...
graphics::render_texture(&render, gl_target.texture, &graphics);

This can help us figure out the issue is from slow Argus->OpenGL or slow OpenGL->rendering.
Thanks.

basta.t.k · June 29, 2018, 8:57am

Hi,
Declaring another GL buffer as you described still crashes.
It does seem to take more time to crash and to only trigger the mutex error and not the memory bloat error.

AastaLLL · July 3, 2018, 6:11am

Hi,

Could you remove the update of display and check it again?

--- graphics::render_texture(&render, gl_target.texture, &graphics);

Thanks

basta.t.k · July 10, 2018, 9:28am

Hi,
With a single GL target and no texture rendering it still crashes with the mutex error.

AastaLLL · July 13, 2018, 7:21am

Hi

In experiment in comment #27, could you try with only read op?

surf2Dread(&data, src, x * sizeof(uint8_t), y);
//            surf2Dwrite(data, dst, x * sizeof(uint8_t), y);

Thanks.

basta.t.k · July 13, 2018, 9:00am

Hi,
I am a bit confused by our workflow here. I gave you a sample code to reproduce the issue, and you successfully did. But then you asked me some trivial questions that are quicker to check by glancing at the code than by asking me:

Now you ask me to apply trivial modifications (literaly commenting a single line) and to report the result when you could very quickly test it on you side by modifying my code sample. I don’t mean to be rude and I understand that you have other things to do but why are we proceeding like this ? It seems like a waste of time for both of us. Anyway, thank you for your help and I will keep reporting on trivial modifications if there is no alternative.

AastaLLL · July 17, 2018, 7:24am

Hi,

We are sorry that our replies cannot meet your expectation.
We prefer to give user suggestions rather than debugging for them, especially for an customized issue.

This error occurs in a custom application and cannot be reproduced in our official sample.
As a result, it’s more likely something incorrect in user implementation.

We are sorry that we cannot figure out the error immediately due to the complicated pipeline.
But we try to give some suggestion to help you debug.

For your issue, we still think there is an illegal access in your application.
Maybe you can find some useful information in this slides.

If you are convinced that there is an issue in our camera or EGL driver.
Please reproduce this issue with our official sample.

Thanks.

basta.t.k · July 18, 2018, 3:45pm

Hi,
I believe that the issue is most likely coming from either some misdocumented behavior that would make my accesses illegal or the blackbox implementation. I understand this is a complicated issue and this is precisely why I am asking for your help, I was simply expecting you to be more implicated in the debugging and to experiment on your side as well.

I think my code sample is quite minimalist, even if it is obviously not perfect and made from scratch. If you insist and if it would implicate you in the debugging, I will try and reproduce the issue with your official sample if I find the time to do so. How exactly should I proceed for you to be satisfied with my sample ?

AastaLLL · July 23, 2018, 10:28am

Hi,

Would you please check your program with cuda-memcheck or nvprof?

Memoy bloat may be caused by non-release image buffer.
Since Argus create new buffer each frame, not successful release may cause memory bloat.
(If yes, we still need to check why the buffer is not released.)

Could you help to confirm this?

Thanks.

basta.t.k · July 23, 2018, 10:49am

Hi,

Running memcheck doesn’t display any additional information and ends with:

========= Error: process didn't terminate successfully
========= No CUDA-MEMCHECK results found

If I close the window before it crashes:

========= ERROR SUMMARY: 0 errors

Running nvprof ends with:

==3540== Profiling application: ./build/bin/interop
==3540== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   57.51%  102.24us        94  1.0870us     576ns  22.688us  cuDeviceGetAttribute
                   28.19%  50.112us         1  50.112us  50.112us  50.112us  cudaSetDevice
                    6.80%  12.096us         1  12.096us  12.096us  12.096us  cuDeviceTotalMem
                    4.39%  7.8080us         3  2.6020us     992ns  4.9280us  cuDeviceGetCount
                    1.75%  3.1040us         2  1.5520us  1.3760us  1.7280us  cuDeviceGet
                    1.37%  2.4320us         1  2.4320us  2.4320us  2.4320us  cuDeviceGetName
======== Error: Application received signal 134

AastaLLL · August 2, 2018, 9:52am

Hi,

As comment #24 said, please apply following procedure for each frame:

while(...){
  // Argus
  cuEGLStreamConsumerAcquireFrame -> Map EGL frame
  // Display
  Map EGL -> memory copy -> Unmap EGL
  // Argus
  cuEGLStreamConsumerReleaseFrame
}

Here is a sample for your reference:
https://github.com/dusty-nv/jetson-inference/blob/master/imagenet-camera/imagenet-camera.cpp#L183

Thanks.

Topic		Replies	Views
Libargus crashing with cuda-openGL interop Jetson TX1	17	2325	December 27, 2017
eglstream with gstreamer pipe problem in TX1 28.2.1 with Jetpack 3.3 Jetson TX1	30	1774	August 16, 2019
pthread mutex lock when include cuGraphicsEGLRegisterImage call Jetson TX2	14	1269	October 18, 2021
Gstreamer writing to CUDA memory and zero copy cv::cuda::GpuMat with Jetpack 5.1.2 Jetson AGX Xavier camera , cuda , gstreamer	11	422	August 13, 2025
EGLStream(CUDA) -> cv::cuda::GpuMat using Argus & nppi Computer Vision & Image Processing opencv , cuda	16	2044	August 31, 2023
EGL acquire adjacent two frames use EGL_STREAM_TIME_PRODUCER_KHR, why the timestamps diff about 80ms Jetson TX2	21	2645	October 18, 2021
Why EGLstream API spend long time ? Jetson TX2	5	907	October 18, 2021
EGLstream DMA Consumer solutions Jetson TX2 gstreamer	8	1275	August 14, 2023
pthread mutex lock when include cuGraphicsEGLRegisterImage call Jetson TX2	44	3990	October 18, 2021
Segmentation fault when query the EGL Stream for new frame Jetson TX2	12	1901	October 18, 2021

Consuming an EGLStream from CUDA causes memory bloat

Related topics