So in short, I am timing my GPU version of simple Lanczos (using mostly cublas functions) vs. the Intel MKL(Math Kernel Library) blas functions for a single thread. What is odd is the results for matrices 700x700 vs. 800x800. For some reason, the 700x700 is consistently 40 ms slower than the 800x800.
I am timing in the following manner (assume timing is the exact same for each implementation): Pull in a random matrix, set up loop to run through 6 times. Time iterations 1,2,3,4,5 leaving out iteration 0. Do this 3 times each on different random data sets.
This gives me a total of 15 runs of each implementation.
One other weird result is that the time from 800x800 is approximately half of 900x900, but I see no reason why it should be doubling, as 900x900 vs. 1000x1000 is only about a 40ms difference.
Could these things be a result of how Cublas is handling the matrices? Or could it be something memory related. If you would like to look at my excel sheet after I am done doing the timings I can share that. I am going up to 4000x4000 for now, but haven’t finished yet.
Edit: The odd results are only for the GPU side. The MKL side seems to be working as expected.