Help with cublasXT - sgemm does not scale

Dear,

I’m trying to use cublasXtSgemm on 4 GPUs.

According to the documentation, the matrices can be on either on the host or on the device.

I’m placing all three matrices A, B and C on the device 0.

E.g., I cudaMalloc A,B and C and also transfers the values of A and B.

Then I call cublasXtSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N,M,K, &alpha, d_b, N,d_a, K, &beta, d_c, N);

The problem is that cublasXtSgemm does not scale, even for very big matrices. I could see just some scaling for square matrices of size 32Kx32K.

When I profile the application I see several HtoD and DtoH transfers on all GPUs, which is strange as the matrices are on device 0.

So, I believe I’m using the multi-GPU version of cuBLAS in a wrong way.

Can someone help me out with this issue?

Thank you,

Tiago Carneiro

Why is that strange? Depending on your systems specifications, which you have not disclosed, data exchanged between GPUs may need to be routed through the host.

It is strange because NVIDIA had much more efficient ways to do this besides using the host to interface these transfers.

So, my question, lest rephrase it → is this a way to avoid this htd and dth transfers?

Thanks!

Generally speaking, peer-to-peer communication between GPUs without host involvement is a thing, but it requires satisfying certain system specifications. It is unknown whether your system does so, but what minimal information has been provided suggests that it does not.

In my understanding the main point of cublasXt is to work around GPU memory capacity limits. It can distribute large GEMMs across multiple GPUs plus the host system, utilizing essentially all of the memory in the entire system (so called “out of core” scenario).

Obviously this requires data to be parceled out and shipped around to available computational resources, and some data to be shipped back to assemble a final result. Where peer-to-peer communication is not available, this communication overhead can limit performance, but that does not detract from the intended main benefit of cublasXt, i.e. handling matrix operations that will not fit into a single GPU.

I am not a cublasXt expert. If you can find language in NVIDIA’s documentation that states that the goal of cublasXt is to accelerate GEMMs by executing them across multiple GPUs, by all means point it out here.

Yes, it does, 4 A100 using SMX.

So my question would be – if there is such a feature, gpu-gpu transfer, how to enable it?

cudaDeviceEnablePeerAcces

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__PEER.html#group__CUDART__PEER

1 Like

Thanks striker! This solves the case. It scales even for small matrices float32 and when I profile there is no h2d transfers by GPUs.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.