How to improve cublasDgemv performance?

The heart of my simulation is the construction of a 2d tensor and then multiplication of that tensor by a vector. In the past, I was certain that execution time was going to be dominated by construction of the tensor, which is O(n^2) and fairly involved.

MUCH to my surprise, I discovered after using cudaprof that my program spends slightly more time in dgemv_main! This is a huge surprise to me, and I wonder if I’m missing something, as I don’t think the tensor calculation is that efficient and certainly is more involved than a matrix-vector multiply. Both the tensor calculation (my code) and cublasDgemv are being called the same number of times using the same block of memory as the 2d matrix.

I was wondering if it could be an alignment problem, but I’m a bit confused there; I thought it was relatively important for matrix alignment that rows be on multiples of 32 bytes (not sure where I got that), but a typical sim size is 600x600 doubles, which would be on a 64-byte boundaries, so that should be good, I’d think.

Any guidelines, or am I actually probably getting decent performance?

try this:…st&p=487032…Fstate%3Dclosed