I’m trying out cublas, and comparing the GPU speed against cpu implementation of strsm function. I have Geoforce560Ti gpu and 2ndgen corei7 cpu clocked to 4.4ghz. A couple of trsm function calls for 7k x 7k A and B matrices takes 1.6 seconds on my gpu and 4.3 seconds on cpu (strsm on cpu is parallelized among all 4 cores). The speed up is 3x, which is not as spectacular as I was expecting.
I wonder if this is what I should be experiencing or I am doing something wrong ?
Edit: this is using cuda4.1 btw.
Some CUBLAS performance numbers (including STRSM) can be found here:
I note that you are not using a top-of-the-line GPU and are massively overclocking the CPU (unless there’s a typo in “4.4ghz”), so your numbers will differ from the data found there.
Thanks for the link, seems like these are the results I should be getting. And no, it is not a typo, clocked up my cpu a little bit. Also did the same test on my corei7 laptop with gt555m gpu - even worse result: 6.1seconds on gpu against 6.3 seconds on cpu… I guess I should start using them both to achieve a 3 second goal on a laptop External Image