cublas large matrix multiplication large matrices won't compute


This question is for anyone familiar with cublas. I am currently running in emulation mode and I’m trying to perform the following matrix computation.

A is a 1 x 15,000,000 vector (yeah yeah, 15 million…)

B is an 15,000,000 by 16 matrix

trying to compute product of that, which should be a 1 x 16 vector using sgemm.

I have the following call:

cublasSgemm( 't','n', m, n, k, alpha, vec, k, mat, k, beta, c, m);

with m = 1, n = 16; k = 15,000,000;

The program goes into the matrix multiply with no problem (no errors reported) but just sits (I’m guessing forever…it’ll easly sit and “compute” for 30 minutes)

Now I think the problem could be the fact that I’m trying to allocate well over a gig of memory on the cpu (remember, running in emulation mode) and since I don’t have nearly that much ram, the paging is causing the computation to run very very slowly.

If that’s not the problem does anyone have any suggestions?

Try sweeping the size of your vector. Plot performance vs. vector length. Shape of curve and location of kinks/drop-offs tells you a lot about what’s going on.

take also in account that cublasSgemm may pad input matrices to sizes that are multiple of 32. That increases the total flop count in your case substantially.

Another factor is that cublasSgemm is a multithreaded code that calls barrier frequently (2*15 000 000/32 = million times in your case).

Thanks for the replies.

I shrank down the data set by a few orders of magnitude and things are definitely moving a bit faster. It is interesting to note that the multiplication of the 15,000,000 matrix did complete (though it took about an hour)

I assume this will all go much faster when I stop running in emulation mode and actually run on the card.

Well running in emulation mode is offcourse MUCH slower…