cuDSS custom memory handler fails on non-zero ordinal devices

I have a small example where cudssExecute(…, CUDSS_PHASE_ANALYSIS, …) works normally on CUDA device 0, but fails on CUDA device 1 if I register a custom cudssDeviceMemHandler_t.

What is happening

  • device 0, custom handler: works

  • device 1, no custom handler: works

  • device 1, custom handler: fails with CUDSS_STATUS_EXECUTION_FAILED

Possibly useful details:

  • I call cudaSetDevice() appropriately before cudssCreate()

  • failure happens during analysis phase

  • I see the same behavior without multithreading

  • I also tried using an explicit cudaMemPool_t and cudaMallocFromPoolAsync() in the handler instead of plain cudaMallocAsync(), with no change

So this seems to be specifically tied to using a custom cuDSS device memory handler on a nonzero device ordinal. Is this a known issue / am I doing something wrong?

Hi @bigsauce!

Sorry about the delay. It is currently (for cudss 0.7.1 or older) not possible to set non-default streams when you have multiple devices in the MG mode of cuDSS (cudssSetStream only changes the “main” stream on the default device). I suspect you need a different, second stream associated with the second device to make the device memory handler work on device 1?

As you noticed, indeed, the issue is very likely not related to multithreading or using the plain cudaMallocAsync vs cudaMallocFromPoolAsync().

Since we have recently looked into extending cuDSS MG mode to support setting streams and I hope this is the only thing what prevents your use case from working, could you share your reproducer so that we can check the behavior internally?

Thanks,
Kirill