The operation is memory bound. You want to do the minimum number of memory accesses theoretically possible in order to populate your matrix (or even better, like @njuffa said, don’t do the copying at all, but I don’t have any great suggestions there ATM). That theoretical minimum is 784 reads and 60000*784 writes (or possibly 59999*784 writes).

If you study the Sger operation carefully (read the arithmetic in the CUBLAS manual, please), and count the number of memory accesses (per element) that would be needed to your matrix, I come up with a minimum of 3: one to write an initial value of 0, one to read, one to write the rank-1 update + sum. I don’t see how you get to optimality with that particular arithmetic.

A thrust device vector allocation automatically also initializes the data to zero. Plus you need at least 1 for the write (copy) operation. Yes, you can work around this with a flat allocation and a thrust device pointer instead of a device vector. So you might be able to get to “optimality” with thrust.

I consider myself a reasonably competent CUDA programmer. One of the implications is that I can write a pretty simple CUDA code and immediately figure out the number of accesses needed or will be generated. With thrust, I would have to write the code, then either wade through a templated library counting accesses, or analyze the result with a profiler, to see if it is really “optimal”. With CUBLAS, it’s a little better, I read the docs. Then I write the code.

With CUDA, I just write the code, and I have a high level of confidence of being pretty near optimal. I’m darn sure in CUDA I can write a kernel that will read exactly 784 elements in an optimized (coalesced) fashion, so exactly 784 float reads, and do exactly 60000*784 coalesced float writes, with no extra reads or writes anywhere.

I’m not anywhere near as confident of getting to that definition of optimality with CUBLAS (I’m quite sure not) and/or with thrust (maybe).

I would most certainly not say the same thing about a sort, or a matrix multiply, or a prefix sum. I think I can write a reasonably good reduction, however, I would always use a library call for sort, matrix multiply, or prefix sum (except for silly cases like a 32-element sort, or prefix sum).

For an algorithm that is memory bound, your goal is to make the minimum possible number of memory accesses, in an efficient way.