Odd timing results Intel MKL vs. My GPU implementation

senorbum · July 22, 2008, 4:24pm

So in short, I am timing my GPU version of simple Lanczos (using mostly cublas functions) vs. the Intel MKL(Math Kernel Library) blas functions for a single thread. What is odd is the results for matrices 700x700 vs. 800x800. For some reason, the 700x700 is consistently 40 ms slower than the 800x800.

I am timing in the following manner (assume timing is the exact same for each implementation): Pull in a random matrix, set up loop to run through 6 times. Time iterations 1,2,3,4,5 leaving out iteration 0. Do this 3 times each on different random data sets.

This gives me a total of 15 runs of each implementation.

One other weird result is that the time from 800x800 is approximately half of 900x900, but I see no reason why it should be doubling, as 900x900 vs. 1000x1000 is only about a 40ms difference.

Could these things be a result of how Cublas is handling the matrices? Or could it be something memory related. If you would like to look at my excel sheet after I am done doing the timings I can share that. I am going up to 4000x4000 for now, but haven’t finished yet.

Thanks,

Joe

Edit: The odd results are only for the GPU side. The MKL side seems to be working as expected.

E.D_Riedijk · July 22, 2008, 5:18pm

It depends a bit which functions you are calling. A lot of functions work much faster when input-matrix size is a multiple of 16. 800 is a multiple of 16

senorbum · July 22, 2008, 6:04pm

Ok, I thought it was something like this. I assume my sgemv is what is slowing it down. Maybe when I get time I’ll run a profile to see what is drastically changing during this part. On the plus side, the GPU version is still faster than the CPU version :)

Thanks for the reply

senorbum · July 24, 2008, 4:05pm

I have another quick question that I hope you can answer. Is this in documentation somewhere? I feel like I have seen it, but have no idea where, and it’s nice to provide some documentation to back up the timing results (people get really confused when larger matrices are running faster).

E.D_Riedijk · July 24, 2008, 6:10pm

Well, the CUBLAS sources are where I got the info from when looking at adapting an algorithm for my specific need.

senorbum · July 24, 2008, 6:36pm

Okie.

Topic		Replies	Views
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	10007	March 24, 2014
sgemm precision wrong results cublasSgemm vs MKL sgemm CUDA Programming and Performance	4	5334	December 22, 2007
CUBLAS sgemv slower than CBLAS for small matrix sizes CUDA Programming and Performance	2	1502	February 1, 2010
why matrixMul from samples so slow? CUDA Programming and Performance	7	5071	June 7, 2010
CUBLAS performance issues CUDA Programming and Performance	3	2654	March 21, 2008
Performance query Odd results profiling GPU speed of matrix multiplication using cublas CUDA Programming and Performance	1	1448	February 12, 2010
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28029	February 1, 2011
Reasonable timing with Cublas dgemm and sgemm CUDA Programming and Performance	15	4208	January 14, 2010
cublas performance Do I get the right timings ? CUDA Programming and Performance	2	1086	February 22, 2012
CUBLAS iteration processing time increases with iteration CUDA Programming and Performance	5	3549	August 17, 2007

Odd timing results Intel MKL vs. My GPU implementation

Related topics