Hi,
I am trying to reply but my message gets blocked by the forum security rules (Incapsula) and my access is then denied for some time.
Will keep trying …
Hi,
If your message is blocked for too much log, you can attach a file instead.
Thanks.
1.
I used a blank array as input instead of the camera and it does not crash, even with a 1s sleep in the loop.
2.
I already did this in #15, there is no crash without the CUDA to OpenGL copy.
3.
This is what I use in acquire_stream_frame (src/cuda/core.cpp). I also already used the cudaHistogram sample as a reference.
(I removed all code samples from this reply, I think it is flagged as code injection attack, it was not big)
Hi,
Sorry that we don’t have enough time to check your source code in detail.
But there are some clues we found can share with you.
Do you apply EGL map to every camera frame?
cuResult = cuGraphicsResourceGetMappedEglFrame(&cudaEGLFrame, cudaResource, 0, 0);
Argus is a non-buffered producer and will present a new frame every time.
Please remember to call ELG map when you acquire a new image.
Thanks.
Yes, here is the pattern I use as shown in #1:
cuEGLStreamConsumerAcquireFrame
cuGraphicsResourceGetMappedEglFrame
cuSurfObjectCreate
// use frame array / surface
cuSurfObjectDestroy
cuEGLStreamConsumerReleaseFrame
Hi,
Could you try to use CUDA_RESOURCE_DESC and surf2D[op] rather than cudaArray_t?
Thanks.
Hi,
I replaced the array copy with the following:
cudaResourceDesc rd;
memset(&rd, 0, sizeof(rd));
rd.resType = cudaResourceTypeArray;
rd.res.array.array = array;
cudaSurfaceObject_t surface;
if (!CUDA_CHECK(cudaCreateSurfaceObject(&surface, &rd))) {
return false;
}
cuda_surface_copy(intensity, surface, width, height);
CUDA_CHECK(cudaDestroySurfaceObject(surface));
return true;
the copy kernel being:
const size_t BDX = 32;
const size_t BDY = 4;
__global__
static void kernel(
cudaSurfaceObject_t src,
cudaSurfaceObject_t dst,
const size_t width_bytes,
const size_t height
) {
for (
size_t y = blockIdx.y * blockDim.y + threadIdx.y;
y < height;
y += blockDim.y * gridDim.y
) {
for (
size_t x = blockIdx.x * blockDim.x + threadIdx.x;
x < width_bytes;
x += blockDim.x * gridDim.x
) { // TODO: uint4 vectorisation
uint8_t data;
surf2Dread(&data, src, x * sizeof(uint8_t), y);
surf2Dwrite(data, dst, x * sizeof(uint8_t), y);
}
}
}
void cuda_surface_copy(
cudaSurfaceObject_t src,
cudaSurfaceObject_t dst,
const size_t width_bytes,
const size_t height
) {
dim3 block_dim(BDX, BDY);
dim3 grid_dim(udiv_ceil((unsigned int)(width_bytes), block_dim.x),
udiv_ceil((unsigned int)(height), block_dim.y));
kernel<<<grid_dim, block_dim>>>(src, dst, width_bytes, height);
must(CUDA_CHECK(cudaGetLastError()));
}
It still crashes with the memory bloat error after a few frames.
Hi,
Suppose this is almost identical to the sample cudaHistogram.
Could you check if you can run cudaHistogram on your environment?
Thanks.
Hi,
The cudaHistogram sample runs fine. It also does not show the initial warnings.
The main differences with my code that I can see are:
- It uses argus multi-process version while I use the single-process one
- It submits and waits for a bunch of individual capture requests instead of using a repeat request to fill the pipeline in a running loop
- It does not do any display, and in my case there is no crash without the CUDA to OpenGL copy.
Hi,
Could you help to do another experiment?
Please declare another GL buffer and copy image data into it instead.
...
cuda::GLTarget gl_target;
cuda::GLTarget gl_target_tmp;
...
if (!cuda::map_gl_target(&gl_target_tmp, [&](auto array) {
...
graphics::render_texture(&render, gl_target.texture, &graphics);
This can help us figure out the issue is from slow Argus->OpenGL or slow OpenGL->rendering.
Thanks.
Hi,
Declaring another GL buffer as you described still crashes.
It does seem to take more time to crash and to only trigger the mutex error and not the memory bloat error.
Hi,
Could you remove the update of display and check it again?
--- graphics::render_texture(&render, gl_target.texture, &graphics);
Thanks
Hi,
With a single GL target and no texture rendering it still crashes with the mutex error.
Hi
In experiment in comment #27, could you try with only read op?
surf2Dread(&data, src, x * sizeof(uint8_t), y);
// surf2Dwrite(data, dst, x * sizeof(uint8_t), y);
Thanks.
Hi,
I am a bit confused by our workflow here. I gave you a sample code to reproduce the issue, and you successfully did. But then you asked me some trivial questions that are quicker to check by glancing at the code than by asking me:
Now you ask me to apply trivial modifications (literaly commenting a single line) and to report the result when you could very quickly test it on you side by modifying my code sample. I don’t mean to be rude and I understand that you have other things to do but why are we proceeding like this ? It seems like a waste of time for both of us. Anyway, thank you for your help and I will keep reporting on trivial modifications if there is no alternative.
Hi,
We are sorry that our replies cannot meet your expectation.
We prefer to give user suggestions rather than debugging for them, especially for an customized issue.
This error occurs in a custom application and cannot be reproduced in our official sample.
As a result, it’s more likely something incorrect in user implementation.
We are sorry that we cannot figure out the error immediately due to the complicated pipeline.
But we try to give some suggestion to help you debug.
For your issue, we still think there is an illegal access in your application.
Maybe you can find some useful information in this slides.
If you are convinced that there is an issue in our camera or EGL driver.
Please reproduce this issue with our official sample.
Thanks.
Hi,
I believe that the issue is most likely coming from either some misdocumented behavior that would make my accesses illegal or the blackbox implementation. I understand this is a complicated issue and this is precisely why I am asking for your help, I was simply expecting you to be more implicated in the debugging and to experiment on your side as well.
I think my code sample is quite minimalist, even if it is obviously not perfect and made from scratch. If you insist and if it would implicate you in the debugging, I will try and reproduce the issue with your official sample if I find the time to do so. How exactly should I proceed for you to be satisfied with my sample ?
Hi,
Would you please check your program with cuda-memcheck or nvprof?
Memoy bloat may be caused by non-release image buffer.
Since Argus create new buffer each frame, not successful release may cause memory bloat.
(If yes, we still need to check why the buffer is not released.)
Could you help to confirm this?
Thanks.
Hi,
Running memcheck doesn’t display any additional information and ends with:
========= Error: process didn't terminate successfully
========= No CUDA-MEMCHECK results found
If I close the window before it crashes:
========= ERROR SUMMARY: 0 errors
Running nvprof ends with:
==3540== Profiling application: ./build/bin/interop
==3540== Profiling result:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 57.51% 102.24us 94 1.0870us 576ns 22.688us cuDeviceGetAttribute
28.19% 50.112us 1 50.112us 50.112us 50.112us cudaSetDevice
6.80% 12.096us 1 12.096us 12.096us 12.096us cuDeviceTotalMem
4.39% 7.8080us 3 2.6020us 992ns 4.9280us cuDeviceGetCount
1.75% 3.1040us 2 1.5520us 1.3760us 1.7280us cuDeviceGet
1.37% 2.4320us 1 2.4320us 2.4320us 2.4320us cuDeviceGetName
======== Error: Application received signal 134
Hi,
As comment #24 said, please apply following procedure for each frame:
while(...){
// Argus
cuEGLStreamConsumerAcquireFrame -> Map EGL frame
// Display
Map EGL -> memory copy -> Unmap EGL
// Argus
cuEGLStreamConsumerReleaseFrame
}
Here is a sample for your reference:
https://github.com/dusty-nv/jetson-inference/blob/master/imagenet-camera/imagenet-camera.cpp#L183
Thanks.