CUFileThreadPoolWorker::run(): Assertion `0' failed


I’am trying to run a code in a multi-threaded setting but it’s giving me an error python: cufile_worker_thread.h:57: virtual void.
python: cufile_worker_thread.h:57: virtual void CUFileThreadPoolWorker::run(): Assertion 0’ failed.`

The code is running inside the latest MONAI container. Interestingly, when the code is run with only a single thread, it works. I’am trying to run it on a HPC cluster using Load Sharing Facility (LSF) scripts for job submission. The job is submitted with the following parameters:
3 nodes, 3 cores per node, 1 gpu per node, 100 GB RAM, H100 GPU
The Cufile log looks like this:

Could you please advise on how I can eliminate this error. Thank you!

Hi Anurag,

Could you provide the entire cuFile log. The neighboring context may help us debug better.
From the snippet pasted here, it looks like the cuda device is not available at this time and as a result the cuDevicePrimaryCtxRetain cuda call fails.

This indicates that requested CUDA device is unavailable at the current time. Devices are often unavailable due to use of CU_COMPUTEMODE_EXCLUSIVE_PROCESS or CU_COMPUTEMODE_PROHIBITED.

My suspicion is that one of the compute mode settings are effected in this environment preventing the thread to retain the context on the same device.

With only single thread it is the sole entity using the context and it works smoothly

Hi PrashantP,

Thanks a lot for helping me out on this one!

Sure, I am attaching here the cufile log
cufile.log (124.3 KB)

Also I noticed the code runs on some other gpu nodes but fails on other specific nodes. Do you think it can be driver related? Also I get a hint from your answer that it could be related to context switching between different processes on the GPU. Is there any way I could bind the gpu using affinity or something to a specific lsf job. Thanks!!

Hi Anurag, will analyze the logs and add if I find anything else.

I am not very conversant on what settings in lsf one would need to make to affinitize gpu to a particular lsf job. But I would imagine LSF should have a way to achieve that.

My suspicion is that on the failing nodes there is some CUDA setting that is being made which does not allow cuda context sharing for a GPU across multiple running entities. In other words, a given process (across its various threads) is able to use just one CUDA context on a given GPU device at a time. The default CU_COMPUTEMODE_DEFAULT would have been able to allow multiple contexts on the GPU device at the same time (hence the threads are expected to succeed with the cuda call around primary context)

Got it! I’ll also try to check if I can affinitize gpu and see if that helps!

Hi PrashantP,

I hope you are doing well!

and I found out that the gpu job was running in mode=exclusive_process(I’m also attaching the screenshot below). I’m trying some options to see if I can change the mode settings. Do you think if I somehow disable that then it might work? Thanks!!

Hi Anurag,

I believe so. The cuDevicePrimaryCtxRetain failure is due to this flag as per the cuFile.log
Setting this to CU_COMPUTEMODE_DEFAULT, I anticipate the above failure to go away. If not we would need to debug this further.

Got it! I also found this IBM Documentation and will check if I can set it to CU_COMPUTEMODE_DEFAULT. Thanks again for your continued support!