CUFileThreadPoolWorker::run(): Assertion `0' failed

anurag.sharma2 · June 6, 2024, 1:25am

Hi,

I’am trying to run a code in a multi-threaded setting but it’s giving me an error python: cufile_worker_thread.h:57: virtual void.
python: cufile_worker_thread.h:57: virtual void CUFileThreadPoolWorker::run(): Assertion 0’ failed.`

The code is running inside the latest MONAI container. Interestingly, when the code is run with only a single thread, it works. I’am trying to run it on a HPC cluster using Load Sharing Facility (LSF) scripts for job submission. The job is submitted with the following parameters:
3 nodes, 3 cores per node, 1 gpu per node, 100 GB RAM, H100 GPU
The Cufile log looks like this:

Could you please advise on how I can eliminate this error. Thank you!

prashantp · June 7, 2024, 6:42pm

Hi Anurag,

Could you provide the entire cuFile log. The neighboring context may help us debug better.
From the snippet pasted here, it looks like the cuda device is not available at this time and as a result the cuDevicePrimaryCtxRetain cuda call fails.

CUDA_ERROR_DEVICE_UNAVAILABLE = 46
This indicates that requested CUDA device is unavailable at the current time. Devices are often unavailable due to use of CU_COMPUTEMODE_EXCLUSIVE_PROCESS or CU_COMPUTEMODE_PROHIBITED.

My suspicion is that one of the compute mode settings are effected in this environment preventing the thread to retain the context on the same device.

With only single thread it is the sole entity using the context and it works smoothly

anurag.sharma2 · June 7, 2024, 7:32pm

Hi PrashantP,

Thanks a lot for helping me out on this one!

Sure, I am attaching here the cufile log
cufile.log (124.3 KB)

Also I noticed the code runs on some other gpu nodes but fails on other specific nodes. Do you think it can be driver related? Also I get a hint from your answer that it could be related to context switching between different processes on the GPU. Is there any way I could bind the gpu using affinity or something to a specific lsf job. Thanks!!

prashantp · June 7, 2024, 7:55pm

Hi Anurag, will analyze the logs and add if I find anything else.

I am not very conversant on what settings in lsf one would need to make to affinitize gpu to a particular lsf job. But I would imagine LSF should have a way to achieve that.

My suspicion is that on the failing nodes there is some CUDA setting that is being made which does not allow cuda context sharing for a GPU across multiple running entities. In other words, a given process (across its various threads) is able to use just one CUDA context on a given GPU device at a time. The default CU_COMPUTEMODE_DEFAULT would have been able to allow multiple contexts on the GPU device at the same time (hence the threads are expected to succeed with the cuda call around primary context)

anurag.sharma2 · June 7, 2024, 8:41pm

Got it! I’ll also try to check if I can affinitize gpu and see if that helps!

anurag.sharma2 · June 11, 2024, 3:59pm

Hi PrashantP,

I hope you are doing well!

I also checked specifically for CU_COMPUTEMODE_EXCLUSIVE_PROCESS or CU_COMPUTEMODE_PROHIBITED
and I found out that the gpu job was running in mode=exclusive_process(I’m also attaching the screenshot below). I’m trying some options to see if I can change the mode settings. Do you think if I somehow disable that then it might work? Thanks!!

prashantp · June 11, 2024, 5:59pm

Hi Anurag,

I believe so. The cuDevicePrimaryCtxRetain failure is due to this flag as per the cuFile.log
Setting this to CU_COMPUTEMODE_DEFAULT, I anticipate the above failure to go away. If not we would need to debug this further.

anurag.sharma2 · June 11, 2024, 9:34pm

Got it! I also found this IBM Documentation and will check if I can set it to CU_COMPUTEMODE_DEFAULT. Thanks again for your continued support!

anurag.sharma2 · June 12, 2024, 9:37pm

Hi PrashantP,

I hope you are doing well!

As I am still trying to run some jobs with different mode settings but they are still in pending mode due to the cluster being busy. In the meantime, I’ve discovered that the jobs are failing on specific nodes that are using an older version of Nvidia drivers on the H100 GPUs. These nodes are running on driver version 535.129.03 as mentioned here: Version 535.129.03(Linux)/537.70(Windows) :: NVIDIA Data Center GPU Driver Documentation. They mentioned that CUDA will not work in Multi Instance GPU (MIG) mode on CUDA forward compatibility setups with display driver major version 470 or earlier.

I’ve also checked the nodes where the code runs smoothly, and they have driver version 550.54.15: Version 550.54.15(Linux)/551.78(Windows) :: NVIDIA Data Center GPU Driver Documentation. This version specifically mentions a fix for potential corruption when launching kernels on H100 GPUs, which is more likely to occur when the GPU is shared between multiple processes.

Do you think the driver version is the main reason and updating the drivers would resolve the issue, or would it still be necessary to change the compute mode settings? I wanted to share these updates and will also inquire with our HPC Team about the possibility of updating the drivers. Thank you for your continued assistance!

Topic		Replies	Views
Failure with independent devices on independent processes Try it yourself! CUDA Programming and Performance	19	3637	March 10, 2011
Using multi-threaded programs with multiple GPUs in EXCLUSIVE_PROCESS compute mode CUDA Programming and Performance	2	4442	July 30, 2014
multi gpu + exclusive mode + matlab, can't run two processes - kernel crashes CUDA Programming and Performance	39	9446	July 1, 2010
Single Device, Multithreaded host, cuda error: unspecified launch failure CUDA Programming and Performance	0	733	January 2, 2014
Un-specified Launch Failures on CTRL_C Driver corrupting contexts ?? CUDA Programming and Performance	11	1089	February 8, 2011
clCreateContext -> CL_INVALID_DEVICE (-33) in case of GPU usage from multiple threads CUDA Programming and Performance	0	917	November 17, 2017
Support for multi-threaded apps on cuda and multiple applications on cuda CUDA Programming and Performance	13	12869	January 24, 2011
running cuda code multiple times in different threads CUDA Programming and Performance	0	645	July 26, 2018
Strange behavior with multiple host threads using cuFFT CUDA Programming and Performance	5	1700	March 21, 2014
Binding Error Running NAMD on CUDA Computer CUDA Programming and Performance	2	2389	June 4, 2015

CUFileThreadPoolWorker::run(): Assertion `0' failed

Related topics