Fastest matrix-vector multiplication?

Well, one major difference is that you are using shared memory instead of constant memory. What performance results do you achieve using the code i posted ? And why would the Tesla cards perform any worse?

Using constant memory is worse. The L1/L2 cache on the tesla may add latency in this case…

We also may be timing differently - I’m calling the kernel 10 times with a sync after each, and measuring the time (gettimeofday) taken to execute 10 times.

Please tell me more about this topic.

I ran your example code with:

#define COLS 10864

#define M 5672

#define NUMTHREADS 416

And according to the visual profiler this yielded 7851 us which should give (10864*5672 + 5672+10864)4/(785110^-6) = 31.4 GB/s which is 31.4 / 60 => ~52.3%

From what I’ve seen the visual profiler is quite trustworthy, it was built for this!

So I’m not convinced that using constant memory is bad here :-)

Try a different number of threads if you’re on a different device.

When I use my profiler, GPU time is 2354.08 us.

Also, note I am using 1kB==1024 bytes. So, if I convert to your formula, with this timing (using 1GB==1000000000 bytes), I get 104.73 GB/s.

Constant memory doesn’t help - the shared memory bus is wider then the constant memory bus.

Thanks I will try to see what helps. Generally a huge occupancy is not needed for latency hiding.

10^9 bytes equals 1 GB or 0.93 GiB. From wikipedia:

So hence bytes * 10^-9 is correct if you are using GB otherwise you should be using GiB.

According to the programming guide accessing constant memory is as fast as accessing registers.

While this is not quite true for shared memory. The programming guide used to say its as fast as reading from registers which was proven to be simply not true by the community (See publications by V. Volkov and others) and subsequently the guide changed to saying that “shared memory is fast as long as there are no bank conflicts”.

Furthermore, in your implementation you are reading the same vector data from global memory into the shared memory cache in several different blocks (if you are lucky the L2 cache will help out on Fermi). In the constant memory space solution the data is definetly cached only once.

Oh and btw I optimized my solution further by assuming that registers are slightly faster than constant memory ( contrary to documentation ) and used this additional level of data locality to achieve further speedup in a load store sequence into register.

This brought performance up to 78% of peak