Help with cublasXT - sgemm does not scale

carnei26 · January 24, 2024, 3:56pm

Dear,

I’m trying to use cublasXtSgemm on 4 GPUs.

According to the documentation, the matrices can be on either on the host or on the device.

I’m placing all three matrices A, B and C on the device 0.

E.g., I cudaMalloc A,B and C and also transfers the values of A and B.

Then I call cublasXtSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N,M,K, &alpha, d_b, N,d_a, K, &beta, d_c, N);

The problem is that cublasXtSgemm does not scale, even for very big matrices. I could see just some scaling for square matrices of size 32Kx32K.

When I profile the application I see several HtoD and DtoH transfers on all GPUs, which is strange as the matrices are on device 0.

So, I believe I’m using the multi-GPU version of cuBLAS in a wrong way.

Can someone help me out with this issue?

Thank you,

Tiago Carneiro

njuffa · January 24, 2024, 8:38pm

Why is that strange? Depending on your systems specifications, which you have not disclosed, data exchanged between GPUs may need to be routed through the host.

carnei26 · January 25, 2024, 9:04am

It is strange because NVIDIA had much more efficient ways to do this besides using the host to interface these transfers.

So, my question, lest rephrase it → is this a way to avoid this htd and dth transfers?

Thanks!

njuffa · January 25, 2024, 9:37am

Generally speaking, peer-to-peer communication between GPUs without host involvement is a thing, but it requires satisfying certain system specifications. It is unknown whether your system does so, but what minimal information has been provided suggests that it does not.

In my understanding the main point of cublasXt is to work around GPU memory capacity limits. It can distribute large GEMMs across multiple GPUs plus the host system, utilizing essentially all of the memory in the entire system (so called “out of core” scenario).

Obviously this requires data to be parceled out and shipped around to available computational resources, and some data to be shipped back to assemble a final result. Where peer-to-peer communication is not available, this communication overhead can limit performance, but that does not detract from the intended main benefit of cublasXt, i.e. handling matrix operations that will not fit into a single GPU.

I am not a cublasXt expert. If you can find language in NVIDIA’s documentation that states that the goal of cublasXt is to accelerate GEMMs by executing them across multiple GPUs, by all means point it out here.

carnei26 · January 25, 2024, 10:57am

Yes, it does, 4 A100 using SMX.

So my question would be – if there is such a feature, gpu-gpu transfer, how to enable it?

striker159 · January 25, 2024, 12:31pm

cudaDeviceEnablePeerAcces

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__PEER.html#group__CUDART__PEER

carnei26 · January 25, 2024, 1:36pm

Thanks striker! This solves the case. It scales even for small matrices float32 and when I profile there is no h2d transfers by GPUs.

system · February 8, 2024, 1:36pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
cublasXT performance CUDA Programming and Performance	2	646	May 28, 2014
cublasxt auto chunking GPU-Accelerated Libraries	1	889	May 20, 2014
Matrix multiplication performance CUDA Programming and Performance	2	1167	August 3, 2013
CublasXT hybrid CPU/GPU matrix multiplication GPU-Accelerated Libraries	4	819	April 27, 2021
cublas problem with very big matrixes and cublasDgemm slow CUDA Programming and Performance	2	1062	February 23, 2017
cublasXt ................ copy matrix to device explicitly CUDA Programming and Performance	0	341	May 3, 2020
cublas/fft on multiple gpus ? CUDA Programming and Performance	1	1414	December 12, 2008
Any C++ example code for using CUBLAS-XT gemm() routines? GPU-Accelerated Libraries	3	2560	February 24, 2016
Multiply large matrices with cublasSgemm CUDA Programming and Performance	8	1688	April 12, 2017
why cublasHgemm is slower more than cublasSgemm when I use? GPU-Accelerated Libraries	6	4419	January 22, 2019

Help with cublasXT - sgemm does not scale

Related topics