why cublasHgemm is slower more than cublasSgemm when I use?

I use cudaToolKit 8.0.27 ,my GPU is GTX1070 capability 6.1 .
complie flag is -gencode arch=compute_61,code=sm_61 -L/usr/local/cuda-8.0/lib64 cublas culibos cudart_static pthread m stdc++ dl .

In my code ,I use float types and cublasSgemm ,it works verywell. My program totally cost 2~3 second.
Nvprof result is :

==13450== Profiling result:
Time(%) Time Calls Avg Min Max Name
15.98% 110.96ms 44991 2.4660us 1.9520us 3.8090us some_function
15.90% 110.36ms 29994 3.6790us 2.9760us 8.1930us void gemv2T_kernel_val<float, float, float,
14.78% 102.62ms 29994 3.4210us 2.0480us 5.1200us _some_function
9.04% 62.789ms 44991 1.3950us 896ns 2.7520us _some_function
8.14% 56.518ms 1252 45.142us 5.1200us 91.138us void gemmSN_TN_kernel<float, float, float,
7.65% 53.125ms 44991 1.1800us 704ns 2.0490us some_function
7.46% 51.773ms 44991 1.1500us 704ns 2.0490us _some_function
6.84% 47.476ms 44991 1.0550us 512ns 2.0480us some_function
6.60% 45.803ms 44991 1.0180us 416ns 2.0490us _some_function
2.52% 17.486ms 36270 482ns 0ns 65.089us [CUDA memcpy HtoD]
2.40% 16.681ms 19109 872ns 32ns 2.0490us _some_function
2.06% 14.272ms 16159 883ns 32ns 2.0480us [CUDA memcpy DtoD]
0.51% 3.5717ms 4999 714ns 352ns 1.0880us [CUDA memcpy DtoH]
0.09% 635.47us 313 2.0300us 1.6960us 2.7840us some_function
0.02% 115.71us 38 3.0450us 1.0240us 8.1920us [CUDA memset]

==13450== API calls:
Time(%) Time Calls Avg Min Max Name
34.31% 1.36219s 351504 3.8750us 3.1590us 1.1969ms cudaLaunch
18.22% 723.23ms 93773 7.7120us 2.6380us 277.74ms cudaMalloc
14.15% 561.69ms 124986 4.4940us 145ns 270.75us cudaFree
8.05% 319.55ms 2135289 149ns 123ns 392.52us cudaSetupArgument
7.43% 294.76ms 58173 5.0660us 3.1720us 274.24us cudaMemcpy
6.06% 240.65ms 499936 481ns 392ns 275.37us cudaEventCreateWithFlags
4.91% 194.97ms 499936 389ns 331ns 284.66us cudaEventDestroy
2.43% 96.501ms 343706 280ns 241ns 270.58us cudaDeviceGetAttribute
2.37% 94.236ms 62529 1.5070us 1.2170us 271.65us cudaThreadSynchronize
1.57% 62.422ms 351504 177ns 134ns 271.44us cudaConfigureCall
0.30% 12.005ms 31246 384ns 305ns 256.16us cudaGetDevice
0.15% 5.8590ms 31246 187ns 142ns 256.76us cudaGetLastError
0.02% 814.86us 38 21.443us 4.9440us 571.18us cudaMemset
0.01% 463.48us 2 231.74us 171.41us 292.07us cuDeviceTotalMem
0.01% 386.39us 178 2.1700us 117ns 82.661us cuDeviceGetAttribute
0.00% 61.830us 2 30.915us 22.529us 39.301us cuDeviceGetName
0.00% 1.9480us 4 487ns 193ns 1.2020us cuDeviceGetCount
0.00% 1.2780us 4 319ns 190ns 575ns cuDeviceGet
0.00% 602ns 1 602ns 602ns 602ns cuInit
0.00% 330ns 1 330ns 330ns 330ns cuDriverGetVersion

The average cublasSgemm time is 3.6790us

But,I use Half types,it also works correctl with input and output data, but it well slower alot ,compare with float and cublasSgemm.
Here is Nvprof result:

==14951== Profiling result:
Time(%) Time Calls Avg Min Max Name
84.59% 52.7013s 15310 3.4423ms 1.1148ms 4.3508ms maxwell_hgemm_256x128_raggedMn_tn
14.48% 9.02057s 15936 566.05us 180.22us 727.02us maxwell_hgemm_128x128_raggedMn_tn
0.23% 142.23ms 29994 4.7420us 4.0950us 6.8480us some_function
0.22% 139.07ms 44991 3.0900us 2.4950us 4.7360us some_function
0.09% 55.334ms 44991 1.2290us 640ns 2.0480us some_function
0.09% 55.020ms 44991 1.2220us 576ns 2.0480us some_function
0.08% 49.725ms 44991 1.1050us 480ns 2.0480us some_function
0.07% 43.692ms 44991 971ns 415ns 2.0480us some_function
0.07% 43.219ms 44991 960ns 416ns 2.0480us some_function
0.02% 15.080ms 36270 415ns 0ns 64.734us some_function
0.02% 13.946ms 17687 788ns 288ns 1.7920us some_function
0.02% 9.8220ms 13134 747ns 320ns 2.0160us some_function
0.01% 4.2402ms 4658 910ns 544ns 1.6640us some_function
0.00% 2.9544ms 3928 752ns 320ns 5.1200us some_function
0.00% 1.5972ms 4999 319ns 287ns 640ns some_function
0.00% 536.24us 313 1.7130us 1.0230us 2.0480us some_function
0.00% 85.022us 35 2.4290us 1.0240us 5.8240us some_function
0.00% 5.1200us 3 1.7060us 1.0240us 2.0480us some_function

==14951== API calls:
Time(%) Time Calls Avg Min Max Name
94.38% 62.1692s 135009 460.48us 140ns 4.3655ms cudaFree
2.31% 1.52488s 361530 4.2170us 3.4280us 1.1711ms cudaLaunch
1.15% 759.98ms 103796 7.3210us 2.7450us 261.96ms cudaMalloc
0.56% 367.93ms 2502810 147ns 121ns 307.96us cudaSetupArgument
0.48% 317.28ms 58173 5.4540us 4.2140us 287.09us cudaMemcpy
0.36% 240.14ms 499936 480ns 383ns 281.02us cudaEventCreateWithFlags
0.30% 195.75ms 499936 391ns 331ns 271.03us cudaEventDestroy
0.15% 98.841ms 343706 287ns 227ns 270.51us cudaDeviceGetAttribute
0.15% 98.441ms 62529 1.5740us 1.2370us 272.37us cudaThreadSynchronize
0.12% 77.958ms 361530 215ns 165ns 299.14us cudaConfigureCall
0.02% 14.119ms 31246 451ns 333ns 271.05us cudaGetDevice
0.01% 7.4253ms 31246 237ns 163ns 257.63us cudaGetLastError
0.00% 903.15us 35 25.804us 5.3980us 653.02us cudaMemset
0.00% 428.04us 2 214.02us 196.55us 231.50us cuDeviceTotalMem
0.00% 346.29us 178 1.9450us 128ns 71.541us cuDeviceGetAttribute
0.00% 51.773us 2 25.886us 20.593us 31.180us cuDeviceGetName
0.00% 13.388us 4 3.3470us 195ns 12.473us cuDeviceGet
0.00% 1.6050us 4 401ns 114ns 1.0470us cuDeviceGetCount
0.00% 664ns 1 664ns 664ns 664ns cuDriverGetVersion
0.00% 577ns 1 577ns 577ns 577ns cuInit

Use cublasHgemm ,maxwell_hgemm_256x128_raggedMn_tn and maxwell_hgemm_128x128_raggedMn_tn use alot time,and cudafree become so slowly too.

Here is a sample code in my program:

cublasHandle_t handle;
         (__half*)a, A->stride, (__half*)&_beta, (__half*)c, C->stride); 
     a,b,c use cublasAlloc get memory on device,and the data also converted to half correctly.

it really confused me,both of float and half use cublas work result correctly,but,cublasHgemm is so slowly.

Is there any more config should be done before using cublasHgemm?

I think maybe I forget something to do,but I really can`t figure out.

I find in profile result ,cublasHgemm call maxwell_hgemm_256x128_raggedMn_tn,but my GPU is GTX1070,which is pascal architecture. Maybe because of this ?

Because GTX1070 has very low throughput for FP16

The only Pascal GPU with “fast” FP16 currently is P100

You can get a sense of the relative throughputs of the various modes by looking at the instruction throughput table in the programming guide.


The net effect is that FP32 arithmetic is much much faster than FP16 arithmetic on a sm_61 device.

Thanks txbob,
I will do some benchmark, It’s really help me alot.

Hi gu_xiangtao,

I just tested on P100 with a simple benchmark code here for your comparison:

It shows around 2x speed up when doing cublasHgemm compared to cublasSgemm with a square matrix size larger than 2048, that is 2048*2048 elements in a matrix.

why cublasHgemm is slower more than cublasSgemm on the titan V

why cublasHgemm is slower more than cublasSgemm when I use on Titan V?

Is there any setting from the storage transfer data into the calculator ?
if data type is Float 16,how is it transfer on the bus ?