I have a small example where cudssExecute(…, CUDSS_PHASE_ANALYSIS, …) works normally on CUDA device 0, but fails on CUDA device 1 if I register a custom cudssDeviceMemHandler_t.
What is happening
-
device 0, custom handler: works
-
device 1, no custom handler: works
-
device 1, custom handler: fails with CUDSS_STATUS_EXECUTION_FAILED
Possibly useful details:
-
I call cudaSetDevice() appropriately before cudssCreate()
-
failure happens during analysis phase
-
I see the same behavior without multithreading
-
I also tried using an explicit cudaMemPool_t and cudaMallocFromPoolAsync() in the handler instead of plain cudaMallocAsync(), with no change
So this seems to be specifically tied to using a custom cuDSS device memory handler on a nonzero device ordinal. Is this a known issue / am I doing something wrong?