CUDA driver API - multiple threads with the same CuContext

uday1 · October 13, 2022, 7:57am

The CUDA driver API that does the compile and link from PTX to CUDA binary, for example, needs a CUcontext (cuCtxCreate is used to create one). I have a situation where that are multiple such code instances that are being compiled/linked in parallel, and using a separate context for each runs out of memory (error below). Is it possible to use a single context with multiple linker invocations happening in parallel?

error: cuCtxCreate(&context, 0, device) failed with error code out of memory[]
error: cuCtxCreate(&context, 0, device) failed with error code out of memory[]
error: cuCtxCreate(&context, 0, device) failed with error code out of memory[]

The parallelism is desired here (threads on multiple cores) since that really speeds up code generation/binary generation. However, beyond a few tens of invocations in parallel, it’s running out of memory, and I can confirm that running them sequentially works fine (but that’d be slow).

I couldn’t immediately tell from the documentation whether things would be thread-safe with a single CuContext.

Robert_Crovella · October 13, 2022, 2:26pm

The CUDA driver API supports multiple threads using the same context. The CUDA driver API is thread safe with certain exceptions around graph usage. You may still hit a memory limit at some point.

uday1 · October 13, 2022, 4:00pm

Thanks - this is great to know!

uday1 · October 14, 2022, 1:34am

Thanks. Can I know if the memory referred to in the “out of memory” error is memory on the GPU or on the host? If it’s GPU memory, it isn’t immediately clear why device memory is being used for this task.

Does the error above indicate that the memory limit is being hit simply due to more than a certain number of CUDA contexts being created, or is it because of the memory usage after context creation? In each thread, once the context is created, the work being done is along cuLinkCreate, cuLinkAddData, and cuLinkComplete, after which the context is destroyed. (And there are really 32x2 threads here.)

The device in question is the Geforce RTX 3090 (24 GB of GPU DRAM).

Robert_Crovella · October 14, 2022, 2:38am

Typically I would expect that sort of error report to refer to device memory. Without an example, its tough to be certain.

A context uses device memory.

My guess is it is due to the context creation, or perhaps both. A context may typically consume on the order of 300-400MB or even more. Rather than discuss this on a forum post, it should be easy for you to write a test case to find out how much memory each context creation is using. It’s not specified, so I have to speak in generalities, but 300-400MB I think is the right ballpark. Now multiply that by 64 contexts (32x2 threads). Does that use up a sizeable portion of 24GB?

If this were my code, I would immediately refactor to avoid using contexts this way. It strikes me as useless overhead.

uday1 · October 14, 2022, 3:27am

I just ran while keeping an eye on the device memory being used (using nvidia-smi) and CPU utilization as well. I can confirm that it’s device memory utilization that’s reaching tens of GB.

Thanks. It just looked odd that a CUDA context and device memory was being used for tasks like compilation and linking from PTX → GPU assembly – something that appears to be completely host-side work with no use of the GPU runtime or any GPU-side computation. The end-result of these CUDA driver API calls in this case is a cubin.

Robert_Crovella:

uday1:

Does the error above indicate that the memory limit is being hit simply due to more than a certain number of CUDA contexts being created, or is it because of the memory usage after context creation?

My guess is it is due to the context creation, or perhaps both. A context may typically consume on the order of 300-400MB or even more. Rather than discuss this on a forum post, it should be easy for you to write a test case to find out how much memory each context creation is using. It’s not specified, so I have to speak in generalities, but 300-400MB I think is the right ballpark. Now multiply that by 64 contexts (32x2 threads). Does that use up a sizeable portion of 24GB?

If this were my code, I would immediately refactor to avoid using contexts this way. It strikes me as useless overhead.

I didn’t exactly determine how much memory each context was using, but your guess is correct in all certainty. I did see memory utilization go over 10 GB momentarily before I saw the out-of-memory errors. ~400 MB * 64 would cross over the available 24 GB.

Using a separate context for each pass thread instance that was compiling code down for GPUs was an oversight here (https://github.com/llvm/llvm-project/blob/00b9bed1f05a72962643f8a41d30256c55bd19f5/mlir/lib/Dialect/GPU/Transforms/SerializeToCubin.cpp#L94) - this should be fixed. It hadn’t been an issue for a handful of kernels which was the typically use-case and so this remained hidden.

Robert_Crovella · October 14, 2022, 1:12pm

I’m not suggesting that is the culprit. The culprit, as is evidenced in your original posting, is the cuCtxCreate() call. That is using device memory, for sure. Creating a context, regardless of what you do with it or whether you do anything at all with it, will use some device memory.

system · October 28, 2022, 1:12pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CUDA,Context and Threading CUDA Programming and Performance	6	19499	May 29, 2012
questions memory allocation and CUDA contexts CUDA Programming and Performance	7	11277	February 4, 2008
Contexts and cudaMallocHost Same rules? CUDA Programming and Performance	17	11221	November 15, 2008
Multi-GPU with a single thread and driver API? CUDA Programming and Performance	5	4988	July 25, 2008
creating a global context using driver api by default context created using driver api seem to be th CUDA Programming and Performance	12	1768	June 15, 2011
Multithreading and CUDA CUDA Programming and Performance	6	9083	April 14, 2010
Determine Memory CUDA Context Memory Usage CUDA Programming and Performance	16	10554	March 9, 2019
Questions about multiple CPU threads on a single device Multiple context? CUDA Programming and Performance	1	3332	September 4, 2009
video cards in parallel ? how the use of various video cards in parallel? CUDA Programming and Performance	7	755	July 15, 2011
CUDA + CPU threads CUDA Programming and Performance	5	11656	August 20, 2008

CUDA driver API - multiple threads with the same CuContext

Related topics