VPI Problems demonstrated with CUDA Context

frenzi · January 9, 2023, 1:54am

Continuation of this thread where I was trying to use a separate cuda context for each thread in order to prevent blocking during cuda mallocs/frees which occur during calls such as vpiImageCreateView. Creating an extra cuda context for each thread/engine significantly improves my performance. If commenting out all the parts of my own code that create/destroy VPI resources and not stuff I’ve allocated myself (cudaMalloc/cudaHostAlloc), my engine is able to start and stop fine, as soon as I start adding some VPI resources, it gets unhappy.

I’ve created a minimal application which just creates a set of threads that create and destroy an engine object which only creates a context and stream, and cleans up during destruction, does nothing else. This is produces the kind of errors that I get in my actual engine and somewhat follows its structure. Currently using latest software stack (Jetpack 5.0.2-b231, libnvvpi 2.1.6).

engine.cpp (691 Bytes)
engine.hpp (1.2 KB)
main.cpp (111 Bytes)
CMakeLists.txt (513 Bytes)

AastaLLL · January 9, 2023, 4:01am

Hi,

It’s expected that CUDA tasks can run parallel with a single process and multi-thread scenario.
Do you observe something different?

On Jetson, one process owns one CUDA context.
GPU resources across different CUDA contexts (processes) are time-slicing and cannot run in parallel.

Thanks.

frenzi · January 9, 2023, 4:17am

When running two engines without cuda context process time is ~1s, whereas when wrapping each worker in a separate context results in ~0.8sec. The disparity is increased with more workers. I haven’t looked into MPS before (let alone on jetson, if that’s even a thing).

As mentioned before, most of the time is spend creating views for processing, rather than processing itself, as each view creation in VPI has some malloc/free pair in the backend that I can see with nsight systems, and each worker blocks each other during these malloc/free calls without the extra context.

Either way, I presume the cuda driver complaining on shutdown indicates that there must be some issue with VPI.

AastaLLL · January 9, 2023, 5:12am

Hi,

We are going to check your source further.

In case you don’t know, VPI does provide some API that can control the CUDA context (push/pop).
Please check the below doc for details:

https://docs.nvidia.com/vpi/Context_8h.html

Thanks.

frenzi · January 9, 2023, 5:21am

I did notice that (rather than the vanila context creation), but vpiContextCreateWrapperCUDA has a note that it doesn’t do anything special, just a standard call to vpiContextCreate. I should probably use it anyway for potential forwards compatibility anyway I guess.

https://docs.nvidia.com/vpi/group__VPI__Context.html#ga59972aa88b37e3f14f01b5ff3c89fd94

AastaLLL · January 9, 2023, 6:02am

Hi,

We have checked your source code and want to double-confirm with you.

Do you indicate that when destroying the VPI context with vpiContextDestroy(ctx).
VPI mis-remove the context that is created by CHECK_CU_STATUS(cuCtxCreate(&ctx, CU_CTX_MAP_HOST, 0))?

Thanks.

frenzi · January 9, 2023, 6:15am

I tried a variation which does not create an extra VPI Context object (just the stream) and it has the same behaviour.

class Worker
{
private:
    int id;
    // VPIContext ctx;
    VPIStream stream;

public:
    explicit Worker(int _id) noexcept : id(_id)
    {
        // vpiContextCreate(VPI_BACKEND_CUDA, &ctx);
        vpiStreamCreate(VPI_BACKEND_CUDA, &stream);
        fmt::print("Worker {} Created\n", id);
    }

    ~Worker()
    {
        vpiStreamDestroy(stream);
        // vpiContextDestroy(ctx);
        fmt::print("Worker {} Destroyed\n", id);
    }
};

Hello, world from main!
Starting Engine
Thread 0 starting
Stopping Engine
Thread 1 starting
Worker 0 Created
Worker 0 Destroyed
Worker 1 Created
Worker 1 Destroyed
Thread 0 ending
Thread 1 ending
Engine Stopped
[WARN ] 2023-01-09 17:13:45 (cudaErrorDeviceUninitialized)
[ERROR] 2023-01-09 17:13:45 Error destroying cuda device: `.��
[WARN ] 2023-01-09 17:13:45 (cudaErrorContextIsDestroyed)
[WARN ] 2023-01-09 17:13:45 (cudaErrorContextIsDestroyed)
........
output shortened

frenzi · January 9, 2023, 6:18am

I see speed benefits when having the vpi context, without the cuda context, which is why I have it in the class. I could potentially remove the extra vpi context if I were able to get cuda context to clean-up reliably.

AastaLLL · January 9, 2023, 6:28am

Thanks. We are checking this with our internal team.
Will share more information with you later.

Thanks.

AastaLLL · January 9, 2023, 6:56am

Hi,

If you only need VPI context, have you tried the following which can work on our environment?

void Engine::workerThread(int id)
{
    // CUcontext ctx;
    fmt::print("Thread {} starting\n", id);
    //CHECK_CU_STATUS(cuCtxCreate(&ctx, CU_CTX_MAP_HOST, 0));
    {
        Worker w{id};
    }
    // CHECK_CU_STATUS(cuCtxDestroy(ctx));
    fmt::print("Thread {} ending\n", id);
}

Other codes remain unchanged.

Thanks.

frenzi · January 9, 2023, 7:03am

VPIContext doesnt prevent blocking between threads with cudamalloc/free calls, it only prevents blocking between some higher level VPI functions. I can maybe send a screen shot of nsight systems tomorrow to show the performance difference between with and without cuda context (on my way home atm).

frenzi · January 10, 2023, 12:02am

I’ve attached screenshots of nsight-systems with and without extra cuda context to show performance benefits in using the extra cuda context (both use vpi context). Note the extra locks in the trace without the extra context which results in blocking between processes, increasing the higher-level function time from ~3.5sec to ~4.5sec. nvtx3 labels obfuscated obviously, but you can correlate them with about the same time and order of call.

With CUDA Context

Without CUDA Context

AastaLLL · January 10, 2023, 4:44am

Hi,

Thanks for providing the details.

This looks like not a VPI issue since the error can be reproduced without VPI context creation and destory.
It’s more relate to the usage of CUDA context with driver API for threads.

We are discussing this internally. Will share more information with you soon.

Thanks.

AastaLLL · January 16, 2023, 6:48am

Hi,

Please try the following patch.
The sample can run successfully after adding the vpiContextSetCurrent call.

diff --git a/engine.hpp b/engine.hpp
index 18112e9..2128e4a 100644
--- a/engine.hpp
+++ b/engine.hpp
@@ -41,6 +41,7 @@ public:
     explicit Worker(int _id) noexcept : id(_id)
     {
         vpiContextCreate(VPI_BACKEND_CUDA, &ctx);
+        vpiContextSetCurrent(ctx);
         vpiStreamCreate(VPI_BACKEND_CUDA, &stream);
         fmt::print("Worker {} Created\n", id);
     }

Thanks.

frenzi · January 16, 2023, 11:52pm

Awesome, thanks, works on the “unit test” for me and also my main project works perfectly fine too! As mentioned in the origial thread, there’s no example of cuda/vpi context usage. Maybe this should be added to the docs. It makes sense that context creation doesn’t implicitly bind the thread to the context, only when calling vpiContextSetCurrent does so.

system · January 31, 2023, 2:01am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Creation and Cleanup of CUcontext CUDA Programming and Performance	10	1438	January 17, 2023
questions memory allocation and CUDA contexts CUDA Programming and Performance	7	11363	February 4, 2008
Problems using cuda context CUDA Programming and Performance	4	7892	August 11, 2010
Using CUDA/CudaContexts simultanously from multiple CPU threads CUDA Programming and Performance	4	5532	February 3, 2010
Contexts and cudaMallocHost Same rules? CUDA Programming and Performance	17	11348	November 15, 2008
CUDA contexts and pthreads CUDA Programming and Performance	1	6099	March 5, 2008
CUDA context and multi-threading CUDA Programming and Performance	0	2710	June 17, 2009
CUcontext creation and destruction leads to handles leak How to create/destroy context in the worker CUDA Programming and Performance	10	10377	February 17, 2009
Running cuda-enabled code from a separate thread CUDA Programming and Performance	5	4068	July 15, 2011
Questions about multiple CPU threads on a single device Multiple context? CUDA Programming and Performance	1	3377	September 4, 2009

VPI Problems demonstrated with CUDA Context

Related topics