Creation and Cleanup of CUcontext

I’m developing a vision application with VPI on the Jetson Platform, I have relatively low utilisation so should theoretically be able to handle multiple streams. However when I multithread (with additional VPI Context per thread) each of the threads is frequently waiting for another for cuda calls (malloc, copy, streamsync etc) resulting in approximately the same throughput. Nothing I can do about reducing these calls as they’re all on VPI’s side of things (vpiImageCreateView or vpiImageLockData, even though everything is exclusively flagged as CUDA backend).

Therefore to prevent all this waiting, I’ve tried to give each thread its own cuda context. This significantly increases throughput. I presume I should clean this up when the thread finishes running, however when I try to use cuCtxDestroy, I either get a segfault or cudaErrorContextIsDestroyed/cudaErrorDeviceUninitialized. I ensure that the vision engine is destroyed before the context is destroyed (its in an inner scope), so I’m 99% sure that there’s no cuda allocated memory left danging around, unless VPI has some leftovers (vpiContextDestroy is the last thing I call in my engine’s destructor).

I don’t get any errors when omitting cuCtxDestroy, but I assume I’ve just leaked a context or something, so things will go wrong if I create and destroy many of these engines. Example structure is shown below.

void thread()
CUcontext ctx;
CHECK_CU_STATUS(cuCtxCreate(&ctx, CU_CTX_MAP_HOST, 0));
    Engine engine{}

No Ideas???

Questions that relate to one of NVIDIA’s integrated embedded platforms usually receive faster and / or better answers in the sub-forums dedicated to those platforms. I would therefore suggest to start here: Jetson & Embedded Systems - NVIDIA Developer Forums

I have never used CUDA’s driver-level API, and I don’t know what VPI is, so I won’t even speculate on an answer.

Sorry, no ideas. I find that people who provide a short, complete test case are more likely to get useful feedback. The 6 or so lines of code that you provided that I cannot compile do not shed any light on it at all, for me.

The profilers can provide API tracing functionality. I guess if I thought I owned a context, meaning I had created it, and then when I went to destroy it, I got an error message saying its already destroyed, that would seem weird. At that point I might try to use profiler API tracing to count context destroy calls and see in what parts my code they are occurring.

But I really have no idea what is happening in your case.

Good luck!

Thanks, I guess I could make a minimal working example, I just wanted to demonstrate the structure of the creation and deletion of the context I’m trying, if there was anything obvious in incorrectness of usage. I could make a minimal viable example and see if I get a reproduceable error (rather than the bucket load of proprieatry code I can’t share), maybe I’m missing a cudaFree somewhere for all I know (I’ve quadruple checked resource creation and deletion, maybe VPI is missing a cudaFree…).

Ideally there also wouldn’t be so many cudamalloc/cudafree with vpiImageCreateView(), that’s somehow the bottleneck of the code, but that’s just me complaining haha (I guess that’s creating gpu allocated space to store addresses and strides of the view?)

engine.hpp (1.2 KB)
engine.cpp (691 Bytes)
main.cpp (111 Bytes)
CMakeLists.txt (513 Bytes)

Minimal working example is attached, which has the following output for me:

Hello, world from main!
Starting Engine
Thread 0 starting
Stopping Engine
Thread 1 starting
Worker 0 Created
Worker 0 Destroyed
Worker 1 Created
Worker 1 Destroyed
Thread 0 ending
Thread 1 ending
Engine Stopped
[WARN ] 2023-01-03 15:50:45 (cudaErrorInvalidResourceHandle)
[ERROR] 2023-01-03 15:50:45 Error destroying cuda device: 0��ڪ�
[WARN ] 2023-01-03 15:50:45 (cudaErrorContextIsDestroyed)
[WARN ] 2023-01-03 15:50:45 (cudaErrorContextIsDestroyed)
[WARN ] 2023-01-03 15:50:45 (cudaErrorContextIsDestroyed)
[WARN ] 2023-01-03 15:50:45 (cudaErrorContextIsDestroyed)
..... and more

In my main code, if I use cudactx I get segfaults when trying to recreate a worker class for the second time in one thread, however after removing there is no issues. I tried to use compute-sanitizer but it spits the dummy before it even tries to run my code (and says no issues after running it). Barebones install of latest jetpack on xavier nx, idk what else I’m mean to do to run compute-sanitiser correctly (other than compute-sanitizer --flags exe)

========= Internal Sanitizer Error: Failed to initialize mobile debugger interface. Please check that /dev NVIDIA nodes have the correct permissions
========= Internal Sanitizer Error: Device not supported. Please refer to the "Supported Devices" section of the sanitizer documentation

see here

I don’t have a jetson device to run on, nor a system set up with VPI. So it may be a while before I look at this. You may get more/better response by asking on the relevant jetson forum. If I had to guess, VPI may have some default handling of CUDA contexts, for example if a context is already created and current to the calling thread, then VPI create context may just use that one rather than creating a new one. (that is just a guess). If that were the case, it would explain the behavior. However I didn’t find any documentation to that effect.

FWIW, after a quick perusal of VPI sample codes and example, I didn’t find any that did this:

  • cuda context create
  • VPI context create
  • VPI context destroy
  • cuda context destroy

I found without additional cuda context, each thread would still be waiting for each other to finish cudafree/cudamalloc when analysing with nsight systems. When creating a cuda context for each thread, they wouldn’t be blocked by each other and benchmark time would decrease by 30%.

When poking around with what results in these errors, vpictxcreate and cudamalloc/free in my code runs fine, but if I add any other API call such as vpiimage/stream/payload create (and the respective free) I get these errors.

I wasn’t sure if I was using the driver API incorrectly or something, I haven’t done so before, and I couldn’t really find any examples online, so that’s why I posted here. If its a VPI problem, I could repost in Jetson Forums.

Solved here, extra function call to vpiContextSetCurrent is required to bind VPI context to the thread, overall structure is as follows:

CUcontext cuCtx;
cuCtxCreate(&cuCtx, CU_CTX_MAP_HOST, 0);

VPIContext vpiCtx;
vpiContextCreate(VPI_BACKEND_CUDA, &vpiCtx);

... do work ...


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.