VPI Problems demonstrated with CUDA Context

Continuation of this thread where I was trying to use a separate cuda context for each thread in order to prevent blocking during cuda mallocs/frees which occur during calls such as vpiImageCreateView. Creating an extra cuda context for each thread/engine significantly improves my performance. If commenting out all the parts of my own code that create/destroy VPI resources and not stuff I’ve allocated myself (cudaMalloc/cudaHostAlloc), my engine is able to start and stop fine, as soon as I start adding some VPI resources, it gets unhappy.

I’ve created a minimal application which just creates a set of threads that create and destroy an engine object which only creates a context and stream, and cleans up during destruction, does nothing else. This is produces the kind of errors that I get in my actual engine and somewhat follows its structure. Currently using latest software stack (Jetpack 5.0.2-b231, libnvvpi 2.1.6).

engine.cpp (691 Bytes)
engine.hpp (1.2 KB)
main.cpp (111 Bytes)
CMakeLists.txt (513 Bytes)

Hi,

It’s expected that CUDA tasks can run parallel with a single process and multi-thread scenario.
Do you observe something different?

On Jetson, one process owns one CUDA context.
GPU resources across different CUDA contexts (processes) are time-slicing and cannot run in parallel.

Thanks.

When running two engines without cuda context process time is ~1s, whereas when wrapping each worker in a separate context results in ~0.8sec. The disparity is increased with more workers. I haven’t looked into MPS before (let alone on jetson, if that’s even a thing).

As mentioned before, most of the time is spend creating views for processing, rather than processing itself, as each view creation in VPI has some malloc/free pair in the backend that I can see with nsight systems, and each worker blocks each other during these malloc/free calls without the extra context.

Either way, I presume the cuda driver complaining on shutdown indicates that there must be some issue with VPI.

Hi,

We are going to check your source further.

In case you don’t know, VPI does provide some API that can control the CUDA context (push/pop).
Please check the below doc for details:

https://docs.nvidia.com/vpi/Context_8h.html

Thanks.

I did notice that (rather than the vanila context creation), but vpiContextCreateWrapperCUDA has a note that it doesn’t do anything special, just a standard call to vpiContextCreate. I should probably use it anyway for potential forwards compatibility anyway I guess.

https://docs.nvidia.com/vpi/group__VPI__Context.html#ga59972aa88b37e3f14f01b5ff3c89fd94

Hi,

We have checked your source code and want to double-confirm with you.

Do you indicate that when destroying the VPI context with vpiContextDestroy(ctx).
VPI mis-remove the context that is created by CHECK_CU_STATUS(cuCtxCreate(&ctx, CU_CTX_MAP_HOST, 0))?

Thanks.

I tried a variation which does not create an extra VPI Context object (just the stream) and it has the same behaviour.

class Worker
{
private:
    int id;
    // VPIContext ctx;
    VPIStream stream;

public:
    explicit Worker(int _id) noexcept : id(_id)
    {
        // vpiContextCreate(VPI_BACKEND_CUDA, &ctx);
        vpiStreamCreate(VPI_BACKEND_CUDA, &stream);
        fmt::print("Worker {} Created\n", id);
    }

    ~Worker()
    {
        vpiStreamDestroy(stream);
        // vpiContextDestroy(ctx);
        fmt::print("Worker {} Destroyed\n", id);
    }
};

Hello, world from main!
Starting Engine
Thread 0 starting
Stopping Engine
Thread 1 starting
Worker 0 Created
Worker 0 Destroyed
Worker 1 Created
Worker 1 Destroyed
Thread 0 ending
Thread 1 ending
Engine Stopped
[WARN ] 2023-01-09 17:13:45 (cudaErrorDeviceUninitialized)
[ERROR] 2023-01-09 17:13:45 Error destroying cuda device: `.��
[WARN ] 2023-01-09 17:13:45 (cudaErrorContextIsDestroyed)
[WARN ] 2023-01-09 17:13:45 (cudaErrorContextIsDestroyed)
........
output shortened

I see speed benefits when having the vpi context, without the cuda context, which is why I have it in the class. I could potentially remove the extra vpi context if I were able to get cuda context to clean-up reliably.

Thanks. We are checking this with our internal team.
Will share more information with you later.

Thanks.

Hi,

If you only need VPI context, have you tried the following which can work on our environment?

void Engine::workerThread(int id)
{
    // CUcontext ctx;
    fmt::print("Thread {} starting\n", id);
    //CHECK_CU_STATUS(cuCtxCreate(&ctx, CU_CTX_MAP_HOST, 0));
    {
        Worker w{id};
    }
    // CHECK_CU_STATUS(cuCtxDestroy(ctx));
    fmt::print("Thread {} ending\n", id);
}

Other codes remain unchanged.

Thanks.

VPIContext doesnt prevent blocking between threads with cudamalloc/free calls, it only prevents blocking between some higher level VPI functions. I can maybe send a screen shot of nsight systems tomorrow to show the performance difference between with and without cuda context (on my way home atm).

I’ve attached screenshots of nsight-systems with and without extra cuda context to show performance benefits in using the extra cuda context (both use vpi context). Note the extra locks in the trace without the extra context which results in blocking between processes, increasing the higher-level function time from ~3.5sec to ~4.5sec. nvtx3 labels obfuscated obviously, but you can correlate them with about the same time and order of call.

With CUDA Context

Without CUDA Context

Hi,

Thanks for providing the details.

This looks like not a VPI issue since the error can be reproduced without VPI context creation and destory.
It’s more relate to the usage of CUDA context with driver API for threads.

We are discussing this internally. Will share more information with you soon.

Thanks.

Hi,

Please try the following patch.
The sample can run successfully after adding the vpiContextSetCurrent call.

diff --git a/engine.hpp b/engine.hpp
index 18112e9..2128e4a 100644
--- a/engine.hpp
+++ b/engine.hpp
@@ -41,6 +41,7 @@ public:
     explicit Worker(int _id) noexcept : id(_id)
     {
         vpiContextCreate(VPI_BACKEND_CUDA, &ctx);
+        vpiContextSetCurrent(ctx);
         vpiStreamCreate(VPI_BACKEND_CUDA, &stream);
         fmt::print("Worker {} Created\n", id);
     }

Thanks.

1 Like

Awesome, thanks, works on the “unit test” for me and also my main project works perfectly fine too! As mentioned in the origial thread, there’s no example of cuda/vpi context usage. Maybe this should be added to the docs. It makes sense that context creation doesn’t implicitly bind the thread to the context, only when calling vpiContextSetCurrent does so.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.