cuSPARSE Matrix-Vector multiplication in half precision runs slower compared to single precision

daisyron · January 25, 2021, 8:23am

Hi,

I tried to accelerate my code by reducing the numbers accuracy from float (32 bits) to __half (16 bits).

I’m using cuSPARSE to do matrix/vector multiplications - using the cusparseSpMV generic API.

I expected to get about x2 acceleration, but to my surprise I’ve got about x2 deceleration !

In my example the sparse matrix size is: COLS,ROWS,NNZ: 768800 80000 61504000 (have 80 non-zero elements per column).

Typical run time, of GTX1080Ti, for half and single precision I’m getting are:

FLOAT Ax run time: 3.841 msec.
FLOAT A^Ty run time: 2.09 msec.
HALF Ax run time: 9.885 msec.
HALF A^Ty run time: 4.666 msec.

I’m using CUDA 11.0 of fedora 31. The cuSPARSE version is 11.1.1.245. Driver version 450.66.

Any comment?
Do I have to configure cuSPARSE specially to get acceleration in half precision?

Thanks in advance,

Ron.

Robert_Crovella · January 25, 2021, 3:12pm

GTX1080Ti has low-rate half precision throughput, and it generally won’t give interesting performance for half-precision calculations. See here. It’s difficult to say if there will be a net improvement (due to reduced data storage and bandwidth) or net reduction (due to limitation on calculation throughput) without studying a specific example.

daisyron · January 25, 2021, 3:55pm

Thanks for your quick response!
I’ll try on another GPU having more advanced compute capability.

daisyron · January 25, 2021, 10:37pm

Hi Robert,

I tried the same example on Xavier GPU, which has compute capability 7.2.

According to the Maximum Instruction Throughput table, this compute capability provides 128 half precision results per clock cycle per multiprocessor, while for single precision 64. As far as I understand this should be translated to reduced run time by half when using half precision arithmetic.

The times I’ve got are:

FLOAT Ax run time: 21.228 msec.
FLOAT A^Ty run time: 7.862 msec.
HALF Ax run time: 35.775 msec.
HALF A^Ty run time: 9.624 msec.

You can see that also here there is no improvement, even though the ratios are not as bad as for the GTX1080Ti.

What I can take from this experiment is that in my example the limitting factor in the computation is memory bandwidth that prevent gaining the much faster arithmetic.

Am I right?

Just for the record, on the Xavier I’m using the cuda is 10.2 and the cuSPARSE library version is 10.3.1.89.

Regards,

Ron.

Robert_Crovella · January 25, 2021, 11:03pm

I would have to study a specific example. Even then, I might not have much insight.

daisyron · January 26, 2021, 10:45am

Hi Robert,

Is there a way to configure cuSPARSE to use half2? In its generic interface one should use the cudaDataType which takes things like CUDA_R_16F. There is no half2 selection there…

Regards,

Ron.

Robert_Crovella · January 26, 2021, 2:50pm

Correct. I don’t have any further suggestions/comments.

Topic		Replies	Views
cusparseCsrmvEx with half I/O and float calculation. GPU-Accelerated Libraries	5	863	October 30, 2019
cuspraseCsrmvEx half i/o and float calculation Deep Learning (Training & Inference) mixed-precision	0	566	October 8, 2019
Errors about cuSPARSE blocked ellpack format with fp16 compute type GPU-Accelerated Libraries cuda , cusparse	14	555	January 12, 2024
cusparse csrgemm2 and half precision GPU-Accelerated Libraries	4	569	September 10, 2019
Massive performance decrease in cuSPARSE spMM from v10.1 to v10.2 CUDA Programming and Performance performance	1	636	January 22, 2021
cusparseLtMatmul is slower than cublasGemmEx GPU-Accelerated Libraries cublas , cusparse	0	633	April 21, 2023
Performance characteristics of cusparseSpMM GPU-Accelerated Libraries nvbugs	1	774	June 28, 2023
RTX 2060 for Half Precision matrix multiply slower than Float precioson GPU-Accelerated Libraries	0	679	September 6, 2019
why cusparse is just 2x faster than mkl CUDA Programming and Performance	1	1070	December 20, 2011
CusparseSpSV Single Precision vs Double Precision Performance GPU-Accelerated Libraries cusparse	3	766	August 24, 2021

cuSPARSE Matrix-Vector multiplication in half precision runs slower compared to single precision

Related topics