in-place cublasSgemm

sgemm: C := alpha*op( A )op( B ) + betaC

the netlib sgemm doesn’t work when B is the same address with C, but cublasSgemm seems not to have this problem, well, almost.

float fzero = 0.0;

float pone = 2;

sgemm_ ("T", "N", &m, &n, &m, &pone, h_A, &m, h_b, &m, &fzero, h_c, &m);

//----- do cublas out of place-----//

cublasSgemm ('T', 'N', m, n, m, pone, d_A, m, d_b1, m, 0, d_c, m);


//----- do cublas in place-----//

cublasSgemm ('T', 'N', m, n, m, pone, d_A, m, d_b1, m, 0, d_b1, m);


where d_A is a copy of h_A on GPU (GTX-280), and d_b1 is a copy of h_b.

So here the netlib sgemm call is producing the correct result, which is compared with in-place and out-of-place calls to cublasSgemm.

Now here is the result:

M	 N			  diff

10	256	32	7.732946e-08

11	256	32	7.548205e-08

12	256	32	7.748339e-08

13	256	32	7.362908e-08

14	256	32	7.567394e-08

15	256	32	7.969494e-08

16	256	32	8.329340e-08

17	256	32	7.053869e-01

18	256	32	1.362628e+00

19	256	32	1.715566e+00

20	256	32	2.332837e+00

21	256	32	8.026725e-08

22	256	32	1.315942e+00

23	256	32	7.964381e-08

24	256	32	1.191590e+00

25	256	32	7.964050e-08

26	256	32	7.814442e-08

27	256	32	7.694383e-08

28	256	32	7.790614e-08

29	256	32	8.144824e-08

30	256	32	8.381258e-08

Please ignore the 32 which is a parameter for other part of my computation. The diff is the F normal of element-wise difference computed by slange(…). xxxx-08 means both in-place and out-of-place calls to cublasSgemm give the correct result agreeing that of the netlib sgemm.

Now clearly for some cases cublasSgemm doesn’t work with in-place, like when M=17, and I tried cuda-1.1, 2.0, 2.2 and 2.3 and all come off the same. Also note that this problem doesn’t exist when transa=transb=‘N’, meaning no transpose.

So any idea about this or any chance this could be fixed in cublas?


Peng Du

University of Tennessee, Knoxville

your in-place method

//----- do cublas in place-----//

cublasSgemm ('T', 'N', m, n, m, pone, d_A, m, d_b1, m, 0, d_b1, m);

is “B = pone * transpose(A) * B + 0 * B”.

This has race condition since not all thread blocks execute simultaneously when m, n are large,

for example, m = n = k = 1024 or higher.

that might be it. But what confused me is that this doesn’t happen with the non-transpose case or I haven’t met one. Guess just need to write my own in place kernel.

I haven’t seen the code for SGEMM in cublas, but it is quite possible that “in-place” really isn’t for the transpose case. For non-powers of two/multiples of 16 (where memory coalescing is difficult/impossible to achieve), it could be that the code binds the matrix to transpose to a texture and the multiply kernel reads from the texture, rather directly from the matrix storage.