Linux vs. Windows XP performance Ran an arbitrary benchmark

I was running an arbitrary benchmark test to see if having 2 elements per thread was going to be faster. I was trying to test the 400-600 cycle global memory access penalty and if CUDA code would have performance changes with latency masking. I got similar results with the following loop doing silly math.

for( x = 0; x < ARR_SIZE; ++x )
{
d_arr[gridThreadId+1]+= d_arr%20;
d_arr[gridThreadId]+= d_arr%20;
++x;
}

I know, x+=2 is better, but that’s not the point. d_arr is just an array of ints, and gridThreadId is a position in the array in global memory that we’re acessing.

Doing this, we get an average of 19 million clock cycles if I call the kernel 1000 times for this one and the identical kernel which sports only one operation per iteration in the loop.

Running the same code on an identical machine in windows XP using Visual Studio yields only 9 million clock cycles! VS uses compiler optimization level 2 while we used default and level 3 optimization for the linux box. I thought it would be similar since they are both running on the GPU almost exclusively (besides passing memory between the host and device).
Any ideas?!

Linux rulez :thumbup: Anyway the multicore scheduler of XP sucks, with 2 core cpu it may be the reason you get double performance under Linux.

@Denia: clock cycles = time spent - so more is worse not better

@perj: the device code gets compiled by nvcc which ignores all C compiler settings unless you set some ptxas options explicitly, so it is probably not the source of the difference.

I assume you are talking about clock cycles measured on the CPU, right? So make sure you are comparing identical clock frequencies. Also, you did call cudaThreadSynchronize after the last kernel call?

Peter

Right it is worst, just double worst.
It seems that your division is performed twice on Linux but should be constant for the second line of the iteration. I don’t know if nvcc is different under these OS but it appears as compiler optimization problem.

Linux:
for( x = 0; x < ARR_SIZE; ++x )
{
d_arr[gridThreadId+1]+= d_arr%20;
d_arr[gridThreadId]+= d_arr%20;
++x;
}

Windows:
for( x = 0; x < ARR_SIZE; ++x )
{
d_arr[gridThreadId+1]+= d_arr%20;
d_arr[gridThreadId]+=d_arr[gridThreadId+1];
++x;
}

This shouldn’t be optimized that way since d_arr may be overwritten during “d_arr[gridThreadId+1]+= d_arr%20;”. CSE shouldn’t be done here.

You can try cuda_profile to see the GPU time and driver time. Also, if you’re measuring such small kernels, make sure you don’t count the 1st call in. The 1st launch may take millions of cycles to do some run-time optimization. Make sure to call cudaThreadSynchronize after 1st and the last kernel call.

Sorta glad to know XP performs better since we’re not allowed to install linux…

My CUDA kernels show only ~2% performance differences between 32-bit XP and 64-bit linux, and that is probably within the noise. On the other hand, all the code in the project executed on the CPU executes 20% faster in linux, a few pieces of code are even twice as fast. Linux rulez.

By “calling the kernel 1000 times” do you mean that you actually call the kernel 1000 times with

Func<<< Dg, Db, Ns >>>(parameter);

If yes I could imagine that the calls don’t perform similarly on Windows and Linux due to the different drivers. Maybe the calls are the reason for the performance variance. It would be helpful to see how you call the kernel and how you time it.

However different the drivers are, they perform very similarly on both windows and linux. I have kernels that only take 10’s of microseconds to execute. In order to build up a long enough running time to actually measure and get a good average, I keep calling the kernel over and over again until 5 seconds have elapsed. As I stated, performance differences between windows and linux are only a few percent.

Well… I think it’s actually more meaningful to test an EMPTY kernel. Having the hardware, it’s hard to get different performance in GPU. However, operation system may have a role in the driver overhead. Benchmark on that part would be more useful for choosing operation system.

Benchmarks for real-world performance comparisons should always be performed with the indented application. A synthetic benchmark of the driver overhead (which takes up much much less than 1% of the total computation time) is meaningless.

Or you can remove the need for choice of OS entirely and program in a platform independent way, as I have. It took me less than 1% extra work to keep my code running on both windows and Linux (thanks to the help of CMake) and it probably SAVED me a significant amount of debugging time. Some bugs in my code managed to pass unit tests on one OS but not the other, so I found them much earlier than I otherwise would have.

Well… driver overhead takes 10%~100% of my time. That’s really significant for me. And if a performance comparison with the intended application is performed, the choice of OS won’t be important any more since I would then have written a version for each.

Personally, I’m totally against platform independence, mostly because I hate glut and Linux GUI. My application is graphics related, and it use a hell lot of both GL and GUI.

Finally, the card I’m using is owned by microsoft -_-b. I’ll need a benchmark to persuade them to install Linux in the first place. And I can’t afford a G80 myself.

Platform independent GUI does not mean GLUT. Take a look at Qt (Trolltech). You can the open source version and produce multiplatform GUI’s with native look-and-feel for each OS.