cuSPARSE Matrix-Vector multiplication in half precision runs slower compared to single precision

Hi,

I tried to accelerate my code by reducing the numbers accuracy from float (32 bits) to __half (16 bits).

I’m using cuSPARSE to do matrix/vector multiplications - using the cusparseSpMV generic API.

I expected to get about x2 acceleration, but to my surprise I’ve got about x2 deceleration !

In my example the sparse matrix size is: COLS,ROWS,NNZ: 768800 80000 61504000 (have 80 non-zero elements per column).

Typical run time, of GTX1080Ti, for half and single precision I’m getting are:

FLOAT Ax run time: 3.841 msec.
FLOAT A^T
y run time: 2.09 msec.
HALF Ax run time: 9.885 msec.
HALF A^T
y run time: 4.666 msec.

I’m using CUDA 11.0 of fedora 31. The cuSPARSE version is 11.1.1.245. Driver version 450.66.

Any comment?
Do I have to configure cuSPARSE specially to get acceleration in half precision?

Thanks in advance,

Ron.

GTX1080Ti has low-rate half precision throughput, and it generally won’t give interesting performance for half-precision calculations. See here. It’s difficult to say if there will be a net improvement (due to reduced data storage and bandwidth) or net reduction (due to limitation on calculation throughput) without studying a specific example.

Thanks for your quick response!
I’ll try on another GPU having more advanced compute capability.

Hi Robert,

I tried the same example on Xavier GPU, which has compute capability 7.2.

According to the Maximum Instruction Throughput table, this compute capability provides 128 half precision results per clock cycle per multiprocessor, while for single precision 64. As far as I understand this should be translated to reduced run time by half when using half precision arithmetic.

The times I’ve got are:

FLOAT Ax run time: 21.228 msec.
FLOAT A^T
y run time: 7.862 msec.
HALF Ax run time: 35.775 msec.
HALF A^T
y run time: 9.624 msec.

You can see that also here there is no improvement, even though the ratios are not as bad as for the GTX1080Ti.

What I can take from this experiment is that in my example the limitting factor in the computation is memory bandwidth that prevent gaining the much faster arithmetic.

Am I right?

Just for the record, on the Xavier I’m using the cuda is 10.2 and the cuSPARSE library version is 10.3.1.89.

Regards,

Ron.

I would have to study a specific example. Even then, I might not have much insight.

Hi Robert,

Is there a way to configure cuSPARSE to use half2? In its generic interface one should use the cudaDataType which takes things like CUDA_R_16F. There is no half2 selection there…

Regards,

Ron.

Correct. I don’t have any further suggestions/comments.