CUDA driver API - multiple threads with the same CuContext

The CUDA driver API that does the compile and link from PTX to CUDA binary, for example, needs a CUcontext (cuCtxCreate is used to create one). I have a situation where that are multiple such code instances that are being compiled/linked in parallel, and using a separate context for each runs out of memory (error below). Is it possible to use a single context with multiple linker invocations happening in parallel?

error: cuCtxCreate(&context, 0, device) failed with error code out of memory[]
error: cuCtxCreate(&context, 0, device) failed with error code out of memory[]
error: cuCtxCreate(&context, 0, device) failed with error code out of memory[]

The parallelism is desired here (threads on multiple cores) since that really speeds up code generation/binary generation. However, beyond a few tens of invocations in parallel, it’s running out of memory, and I can confirm that running them sequentially works fine (but that’d be slow).

I couldn’t immediately tell from the documentation whether things would be thread-safe with a single CuContext.

The CUDA driver API supports multiple threads using the same context. The CUDA driver API is thread safe with certain exceptions around graph usage. You may still hit a memory limit at some point.

1 Like

Thanks - this is great to know!

Thanks. Can I know if the memory referred to in the “out of memory” error is memory on the GPU or on the host? If it’s GPU memory, it isn’t immediately clear why device memory is being used for this task.

Does the error above indicate that the memory limit is being hit simply due to more than a certain number of CUDA contexts being created, or is it because of the memory usage after context creation? In each thread, once the context is created, the work being done is along cuLinkCreate, cuLinkAddData, and cuLinkComplete, after which the context is destroyed. (And there are really 32x2 threads here.)

The device in question is the Geforce RTX 3090 (24 GB of GPU DRAM).

Typically I would expect that sort of error report to refer to device memory. Without an example, its tough to be certain.

A context uses device memory.

My guess is it is due to the context creation, or perhaps both. A context may typically consume on the order of 300-400MB or even more. Rather than discuss this on a forum post, it should be easy for you to write a test case to find out how much memory each context creation is using. It’s not specified, so I have to speak in generalities, but 300-400MB I think is the right ballpark. Now multiply that by 64 contexts (32x2 threads). Does that use up a sizeable portion of 24GB?

If this were my code, I would immediately refactor to avoid using contexts this way. It strikes me as useless overhead.

I just ran while keeping an eye on the device memory being used (using nvidia-smi) and CPU utilization as well. I can confirm that it’s device memory utilization that’s reaching tens of GB.

Thanks. It just looked odd that a CUDA context and device memory was being used for tasks like compilation and linking from PTX → GPU assembly – something that appears to be completely host-side work with no use of the GPU runtime or any GPU-side computation. The end-result of these CUDA driver API calls in this case is a cubin.

I didn’t exactly determine how much memory each context was using, but your guess is correct in all certainty. I did see memory utilization go over 10 GB momentarily before I saw the out-of-memory errors. ~400 MB * 64 would cross over the available 24 GB.

Using a separate context for each pass thread instance that was compiling code down for GPUs was an oversight here (https://github.com/llvm/llvm-project/blob/00b9bed1f05a72962643f8a41d30256c55bd19f5/mlir/lib/Dialect/GPU/Transforms/SerializeToCubin.cpp#L94) - this should be fixed. It hadn’t been an issue for a handful of kernels which was the typically use-case and so this remained hidden.

I’m not suggesting that is the culprit. The culprit, as is evidenced in your original posting, is the cuCtxCreate() call. That is using device memory, for sure. Creating a context, regardless of what you do with it or whether you do anything at all with it, will use some device memory.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.