sgemm: C := alpha*op( A )op( B ) + betaC
the netlib sgemm doesn’t work when B is the same address with C, but cublasSgemm seems not to have this problem, well, almost.
float fzero = 0.0;
float pone = 2;
sgemm_ ("T", "N", &m, &n, &m, &pone, h_A, &m, h_b, &m, &fzero, h_c, &m);
//----- do cublas out of place-----//
cublasSgemm ('T', 'N', m, n, m, pone, d_A, m, d_b1, m, 0, d_c, m);
cudaThreadSynchronize();
//----- do cublas in place-----//
cublasSgemm ('T', 'N', m, n, m, pone, d_A, m, d_b1, m, 0, d_b1, m);
cudaThreadSynchronize();
where d_A is a copy of h_A on GPU (GTX-280), and d_b1 is a copy of h_b.
So here the netlib sgemm call is producing the correct result, which is compared with in-place and out-of-place calls to cublasSgemm.
Now here is the result:
M N diff
10 256 32 7.732946e-08
11 256 32 7.548205e-08
12 256 32 7.748339e-08
13 256 32 7.362908e-08
14 256 32 7.567394e-08
15 256 32 7.969494e-08
16 256 32 8.329340e-08
17 256 32 7.053869e-01
18 256 32 1.362628e+00
19 256 32 1.715566e+00
20 256 32 2.332837e+00
21 256 32 8.026725e-08
22 256 32 1.315942e+00
23 256 32 7.964381e-08
24 256 32 1.191590e+00
25 256 32 7.964050e-08
26 256 32 7.814442e-08
27 256 32 7.694383e-08
28 256 32 7.790614e-08
29 256 32 8.144824e-08
30 256 32 8.381258e-08
Please ignore the 32 which is a parameter for other part of my computation. The diff is the F normal of element-wise difference computed by slange(…). xxxx-08 means both in-place and out-of-place calls to cublasSgemm give the correct result agreeing that of the netlib sgemm.
Now clearly for some cases cublasSgemm doesn’t work with in-place, like when M=17, and I tried cuda-1.1, 2.0, 2.2 and 2.3 and all come off the same. Also note that this problem doesn’t exist when transa=transb=‘N’, meaning no transpose.
So any idea about this or any chance this could be fixed in cublas?
Thanks!
Peng Du
University of Tennessee, Knoxville