Odd timing results Intel MKL vs. My GPU implementation

So in short, I am timing my GPU version of simple Lanczos (using mostly cublas functions) vs. the Intel MKL(Math Kernel Library) blas functions for a single thread. What is odd is the results for matrices 700x700 vs. 800x800. For some reason, the 700x700 is consistently 40 ms slower than the 800x800.

I am timing in the following manner (assume timing is the exact same for each implementation): Pull in a random matrix, set up loop to run through 6 times. Time iterations 1,2,3,4,5 leaving out iteration 0. Do this 3 times each on different random data sets.

This gives me a total of 15 runs of each implementation.

One other weird result is that the time from 800x800 is approximately half of 900x900, but I see no reason why it should be doubling, as 900x900 vs. 1000x1000 is only about a 40ms difference.

Could these things be a result of how Cublas is handling the matrices? Or could it be something memory related. If you would like to look at my excel sheet after I am done doing the timings I can share that. I am going up to 4000x4000 for now, but haven’t finished yet.



Edit: The odd results are only for the GPU side. The MKL side seems to be working as expected.

It depends a bit which functions you are calling. A lot of functions work much faster when input-matrix size is a multiple of 16. 800 is a multiple of 16

Ok, I thought it was something like this. I assume my sgemv is what is slowing it down. Maybe when I get time I’ll run a profile to see what is drastically changing during this part. On the plus side, the GPU version is still faster than the CPU version :)

Thanks for the reply

I have another quick question that I hope you can answer. Is this in documentation somewhere? I feel like I have seen it, but have no idea where, and it’s nice to provide some documentation to back up the timing results (people get really confused when larger matrices are running faster).

Well, the CUBLAS sources are where I got the info from when looking at adapting an algorithm for my specific need.