Possible bugs in the cuDSS library

Hello,

I’m using both Matlab (latest) and CUDA (latest) in its mexfunctions to utilize my multiple GPUs for a long time. It is basically writing a CUDA-C code, compiling it, and calling directly from Matlab.

I installed the cuDSS (0.3) library and managed to make it work perfectly with the setup I explained for a single GPU. The results and speed are very satisfactory.

I also tried to factorize a large matrix that could not fit into the GPU memory so I had to switch to the CPU-GPU hybrid mode. I’ve realized if the factors cannot fit into the GPU memory the CPU-GPU hybrid mode gives out factorization error. I’ve tried many things including limiting the amount of memory that the GPU can utilize however I couldn’t find a fix. It could be a bug in the library or the environment I’m using it. Either way, I’d like to report this.

The second issue is with the single-thread multi-GPU setup on the same computer. I managed to find the NCCL1 library for Windows and I was able to do basic collective operations successfully. After that, I tried to utilize multiple GPUs to solve a single equation. If the factors can fit into one GPU’s memory then it solves the equations successfully but the other GPUs stand idle. If the factors cannot fit into the first GPU’s memory then the library fails again. This can be due to multiple reasons, the old NCCL library for example, however, I’d like to report this issue too.

These problems can also be related to the mixing of multiple languages or old NCCL library or being on the Windows environment. On the other hand, I had no issues with other CUDA libraries to this day.

Regards

Deniz

Hi Deniz!

I’m glad you have some success in using cuDSS (at least for a single GPU). I’d like to know more details about both of the issues you have encountered.

As the first step in trying to understand what is going on, could you share the output from runs for both issues with env variable CUDSS_LOG_LEVEL=5?
Also, please check the returned value of the calls to cuDSS APIs. If any of the returned values is not CUDSS_STATUS_SUCCESS, this might point us to the rootcause.

  1. Now, for the hybrid memory mode, there is a minimal amount of device memory which cuDSS needs to fit into the GPU memory (see cuDSS Advanced Features — NVIDIA cuDSS documentation), but this is usually much lower than the total size of the factors. Notice, that if you run the example code for the hybrid mode, there is code there which queries that via cudssDataGet() with CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN.

Based on what you report, I suspect the hybrid mode did not get activated in your run at all maybe? This can be the case if the hybrid memory mode has not been turned on before the analysis phase. Please check that this is not the case. The extra logging information should help.

  1. For the MGMN mode, could you point us to where you took the NCCL library for Windows? Could you try some simple MPI ping-pong code with device buffers with this NCCL for Windows?

From what you report, I would also suspect that the MGMN mode has not been activated. The extra logging information should help in this case.
If such a smoke test works, then it is worth investigating the cuDSS behavior. But if it does not, there is no chance =)

I hope some of this helps.

Thanks,
Kirill

Hi again,

I’d like to report more on these issues,

I’d like to start from the MGMN mode. I’ve found a version (1.3.4) of NCCL from a GitHub page and it had Visual Studio project files in it and I compiled it using CUDA 12.3. When I reexamined the documentation for Multi-Gpu support, I realized I needed the libcudss_commlayer_nccl library too. I’ve found cudss_commlayer_nccl.cu file in the Linux version of cuDSS however when I tried to compile it, it required the ncclSend and ncclRecv functions which weren’t available in NCCL1.x version. So this is a dead-end I believe. However, MGMN mode is not so important for me. On the other side, there seems to be very little information for the NCCL1.x on the web

Let me continue with the hybrid mode, I have a system with 128GB of main memory and multiple RTX 3090s (24GB of VRAM). I’m also watching the VRAM and main memory usage through various software in real time. So I’ll give you the results of two experiments and some numbers. (I’m working with complex, 64-bit numbers, matrices, and vectors)

I started with a smaller matrix. I put a limit on VRAM allocation limit of 3GB. CUDSS_DATA_LU_NNZ option said to me there would be 596.213.531 non-zero entries in the factors. When this number is multiplied by 16 bytes (because it should double-complex entries), I would obtain 9GB of RAM usage. (I don’t know if cuDSS uses CSR or COO format for the factors, L and U) .VRAM usage increased by 3 GB and main memory usage increased by 9GB. I was able to solve the system correctly. However, I also set CUDSS_LOG_LEVEL=5 and I received this info written to a file.

[CUDSS][7344][Info][cudssExecute] With a user-defined hybrid device memory limit 3221225472 hybrid device nnz computed as 167444758 (83722379 = 0.07797254156321287 GB for L and 83722379 = 0.07797254156321287 GB for U)

This info entry seems to be wrong. “device nnz computed as 167444758” should mean the number of nnz entries on the device. I figured this out while increasing the allowed VRAM memory. However. 83722379 number of entries shouldn’t be equal to 0.07797254156321287 GBs.

When I increased the allowed VRAM memory (to 24GB) it still allocated 9GB of data on the host side. When I turned off the hybrid mode, the factorization timing reduced from 35secs to 25 secs. So I think it still tries to use the host side when it is unnecessary.

As for the second experiment, I used a larger matrix that would be using 15-16GB of VRAM (for the factors) and it is perfectly solvable with the hybrid mode being off. When I switched to hybrid mode and set the VRAM allocation limit to 24GB, I received this info:

[Info][cudssExecute] With a user-defined hybrid device memory limit 25769803776 hybrid device nnz computed as 1561967081 (479363720 = 0.44644225388765335 GB for L and 475169509 = 0.4425360905006528 GB for U)

CUDSS_DATA_LU_NNZ reports that there would be 954533229 entries and it is consistent with “L_nnz=479363720+U_nnz=475169509”. Like the previous experiment, the GB values didn’t make sense and it tried to allocate 15-16 GB on the host side again. However, unlike the previous experimentation this time the factorization phase exits with CUDSS_STATUS_EXECUTION_FAILED error.

These suggest there should be a bug in the hybrid mode.

Cheers
Deniz

Hi Deniz!

Let me address the issues:

  1. for MGMN mode: AFAIK, NCCL does not support Windows and the fork you have found is heavily out-dated, so unless the authors extend support to newer releases of NCCL, it will not work for cuDSS.
    In general, cuDSS would work with any GPU-aware (able to take in device buffers) communication backend, e.g. GPU-aware MPI. I am not sure which of them work on Windows, though.
  2. for the internal format for the factors, cuDSS is using a custom data structure (as is typical for direct sparse solvers) but first order estimate of nnz * sizeof(data_type) is good enough.
  3. For the reported misinformation:
    “[CUDSS][7344][Info][cudssExecute] With a user-defined hybrid device memory limit 3221225472 hybrid device nnz computed as 167444758 (83722379 = 0.07797254156321287 GB for L and 83722379 = 0.07797254156321287 GB for U)”
    I believe we did not multiply the #nnz with the sizeof of the datatype (0.077 = 83722379 / (1024^3)). We will fix this in the next release.
  4. “So I think it still tries to use the host side when it is unnecessary”
    Yes, unfortunately, this is how the hybrid memory mode works in cudss 0.3.0. Definitely, will be fixed as there is no good reason to have extra overheads and memory allocations when everything fits onto the device.
  5. About “[Info][cudssExecute] With a user-defined hybrid device memory limit 25769803776 hybrid device nnz computed as 1561967081 (479363720 = 0.44644225388765335 GB for L and 475169509 = 0.4425360905006528 GB for U)”:
    I believe it is also a bug in the log information (I guess 1561967081 must have been replaced with 954533229).
  6. Last, but maybe one of the most important issues: about the CUDSS_STATUS_EXECUTION_FAILED:
    Do I understand it right that default mode solved the problem while the hybrid memory mode reported CUDSS_STATUS_EXECUTION_FAILED?
    If so, could you share your matrix with us (maybe in the email thread with cuDSS-EXTERNAL-Group@nvidia.com)? It would be convenient for understanding what exactly happens.

Thanks,
Kirill

Sure, I’ll prepare and send an mtx file to that email. I’m already in contact with them.

  1. I just realized that the nnz number, 1561967081, was wrong there when you pointed it out. I repeated the experiments again using two systems (one with RTX 4090 and another with RTX 3090). They both give out the same number.

  2. Yes when the hybrid mode is off, it works but when it is on it fails using my RTX 3090. Today while preparing to upload the matrix, I made another try using the same matrix, using a very similar system but with an RTX 4090. Interestingly the hybrid option worked (still fails with the system with RTX 3090). It is a different system but it has the same amount of memory, same softwares, and versions installed. I’m kind of lost.

I uploaded the matrix so maybe it can help you.

Deniz

Hi again,

Further testing with different systems suggested to me there might be no bug in the cuDSS library’s hybrid feature.

Both RTX3090 and RTX4090 can utilize the hybrid method on a different system (computer) without a problem.

On the other hand, the system that I use for high-performance computing, which has multiple GPUs and I use it to run my CUDA codes regularly, cause this problem. So, at this point, I can say for other readers that using a regular PC and a single GPU on it probably won’t cause any issues.

Deniz