cudaToNumpy() defect?

Hi, I’ve noticed this in the past but I wasn’t sure where it came from and didnt have much time to dig into it as I was focused on other areas of the project. However now that I got back into this part, I still see that the problem persists, and after some digging I believe it has something to do with the cudaToNumpy() function … Apparently the cudaToNumpy() call works on some buffer that gets corrupted (or rewritten into) before we can finish working on the final numpy array (even if the first thing I do with this numpy array is to safe copy to another place: the b = a.copy() below)

You can reproduce this issue by modifying the detectnet-camera.py in the following manner:

while True:
    img = input.Capture()
    a = jetson.utils.cudaToNumpy()
    b = a.copy()
    cv2.imwrite('/tmp/{}.jpg'.format(time.monotonic()), b)
    detections = net.Detect(img, overlay=opt.overlay)

I would expect all the frames that come from the input (videoSource) to be saved as they came to disk, but they get corrupted/overwritten (you can even see some pieces of the overlays that are made in the detection step to show up in the numpy array just returned by the cudaToNumpy().

I can also confirm that the cudaImages returned by the videoSource component themselves are not corrupted because they show up perfectly on the opengl window without any corruption/overwriting, but when they are converted to numpy they start to show some signs of corruption/overwriting.

I can also confirm that if I sleep 1 second after the Capture I don’t experience this problem anymore. I’m sure that with a much smaller delay it would also work, but I wouldn’t really like to walk that path, I think we all agree that this should work with no delays at all.

Can anyone confirm/explain this behaviour?

Thank you,
Best regards,
Eduardo

Hi, I’ll leave the thread open, but I think I’ve found the reason/solution.
Apparently there’s a function that basically tells the code to wait until the gpu has finished processing whatever it is that it was doing. In this case the gpu was probably still processing the overlay operation of the bounding boxes/description of the previous frame by the time I wanted to access the cuda memory and make a copy of it. So a call to jetson.utils.cudaDeviceSynchronize() just before the cudaToNumpy() apparently waits until the ongoing operations on the cuda memory finish, doing the trick. Something similar to what I achieved with the fixed delays in my previous experiments, but in a much better sychronized manner.

Thank you and best regards.
Eduardo

Hi @drakorg, that is correct - after performing asynchronous GPU operations, you should use the cudaDeviceSynchronize() function before attempting to access the data on the CPU.

cudaToNumpy() maps the memory to a numpy array, so the GPU can still change it (and vice versa, changes to the numpy array will show up to the GPU).