Linux vs. Windows XP performance Ran an arbitrary benchmark

perj · June 14, 2007, 9:34pm

I was running an arbitrary benchmark test to see if having 2 elements per thread was going to be faster. I was trying to test the 400-600 cycle global memory access penalty and if CUDA code would have performance changes with latency masking. I got similar results with the following loop doing silly math.

for( x = 0; x < ARR_SIZE; ++x )
{
d_arr[gridThreadId+1]+= d_arr%20;
d_arr[gridThreadId]+= d_arr%20;
++x;
}

I know, x+=2 is better, but that’s not the point. d_arr is just an array of ints, and gridThreadId is a position in the array in global memory that we’re acessing.

Doing this, we get an average of 19 million clock cycles if I call the kernel 1000 times for this one and the identical kernel which sports only one operation per iteration in the loop.

Running the same code on an identical machine in windows XP using Visual Studio yields only 9 million clock cycles! VS uses compiler optimization level 2 while we used default and level 3 optimization for the linux box. I thought it would be similar since they are both running on the GPU almost exclusively (besides passing memory between the host and device).
Any ideas?!

Denia · September 10, 2007, 11:37pm

I was running an arbitrary benchmark test to see if having 2 elements per thread was going to be faster. I was trying to test the 400-600 cycle global memory access penalty and if CUDA code would have performance changes with latency masking. I got similar results with the following loop doing silly math.

for( x = 0; x < ARR_SIZE; ++x )
{ 
d_arr[gridThreadId+1]+= d_arr%20;

d_arr[gridThreadId]+= d_arr%20;

++x;
}
I know, x+=2 is better, but that’s not the point. d_arr is just an array of ints, and gridThreadId is a position in the array in global memory that we’re acessing.

Doing this, we get an average of 19 million clock cycles if I call the kernel 1000 times for this one and the identical kernel which sports only one operation per iteration in the loop.

Running the same code on an identical machine in windows XP using Visual Studio yields only 9 million clock cycles! VS uses compiler optimization level 2 while we used default and level 3 optimization for the linux box. I thought it would be similar since they are both running on the GPU almost exclusively (besides passing memory between the host and device).

Any ideas?!

[snapback]209489[/snapback]

Linux rulez External Media Anyway the multicore scheduler of XP sucks, with 2 core cpu it may be the reason you get double performance under Linux.

prkipfer · September 11, 2007, 10:03am

@Denia: clock cycles = time spent - so more is worse not better

@perj: the device code gets compiled by nvcc which ignores all C compiler settings unless you set some ptxas options explicitly, so it is probably not the source of the difference.

I assume you are talking about clock cycles measured on the CPU, right? So make sure you are comparing identical clock frequencies. Also, you did call cudaThreadSynchronize after the last kernel call?

Peter

Denia · September 11, 2007, 11:24pm

Right it is worst, just double worst.
It seems that your division is performed twice on Linux but should be constant for the second line of the iteration. I don’t know if nvcc is different under these OS but it appears as compiler optimization problem.

Linux:
for( x = 0; x < ARR_SIZE; ++x )
{
d_arr[gridThreadId+1]+= d_arr%20;
d_arr[gridThreadId]+= d_arr%20;
++x;
}

Windows:
for( x = 0; x < ARR_SIZE; ++x )
{
d_arr[gridThreadId+1]+= d_arr%20;
d_arr[gridThreadId]+=d_arr[gridThreadId+1];
++x;
}

asadafag · September 12, 2007, 6:30am

This shouldn’t be optimized that way since d_arr may be overwritten during “d_arr[gridThreadId+1]+= d_arr%20;”. CSE shouldn’t be done here.

You can try cuda_profile to see the GPU time and driver time. Also, if you’re measuring such small kernels, make sure you don’t count the 1st call in. The 1st launch may take millions of cycles to do some run-time optimization. Make sure to call cudaThreadSynchronize after 1st and the last kernel call.

Sorta glad to know XP performs better since we’re not allowed to install linux…

MisterAnderson42 · September 12, 2007, 11:57am

My CUDA kernels show only ~2% performance differences between 32-bit XP and 64-bit linux, and that is probably within the noise. On the other hand, all the code in the project executed on the CPU executes 20% faster in linux, a few pieces of code are even twice as fast. Linux rulez.

seb · September 13, 2007, 4:44pm

By “calling the kernel 1000 times” do you mean that you actually call the kernel 1000 times with

Func<<< Dg, Db, Ns >>>(parameter);

If yes I could imagine that the calls don’t perform similarly on Windows and Linux due to the different drivers. Maybe the calls are the reason for the performance variance. It would be helpful to see how you call the kernel and how you time it.

MisterAnderson42 · September 13, 2007, 5:26pm

By “calling the kernel 1000 times” do you mean that you actually call the kernel 1000 times with
Func<<< Dg, Db, Ns >>>(parameter);
If yes I could imagine that the calls don’t perform similarly on Windows and Linux due to the different drivers. Maybe the calls are the reason for the performance variance. It would be helpful to see how you call the kernel and how you time it.

[snapback]250572[/snapback]

However different the drivers are, they perform very similarly on both windows and linux. I have kernels that only take 10’s of microseconds to execute. In order to build up a long enough running time to actually measure and get a good average, I keep calling the kernel over and over again until 5 seconds have elapsed. As I stated, performance differences between windows and linux are only a few percent.

asadafag · September 14, 2007, 5:37am

Well… I think it’s actually more meaningful to test an EMPTY kernel. Having the hardware, it’s hard to get different performance in GPU. However, operation system may have a role in the driver overhead. Benchmark on that part would be more useful for choosing operation system.

MisterAnderson42 · September 14, 2007, 1:54pm

Benchmarks for real-world performance comparisons should always be performed with the indented application. A synthetic benchmark of the driver overhead (which takes up much much less than 1% of the total computation time) is meaningless.

Or you can remove the need for choice of OS entirely and program in a platform independent way, as I have. It took me less than 1% extra work to keep my code running on both windows and Linux (thanks to the help of CMake) and it probably SAVED me a significant amount of debugging time. Some bugs in my code managed to pass unit tests on one OS but not the other, so I found them much earlier than I otherwise would have.

asadafag · September 15, 2007, 3:03am

Well… driver overhead takes 10%~100% of my time. That’s really significant for me. And if a performance comparison with the intended application is performed, the choice of OS won’t be important any more since I would then have written a version for each.

Personally, I’m totally against platform independence, mostly because I hate glut and Linux GUI. My application is graphics related, and it use a hell lot of both GL and GUI.

Finally, the card I’m using is owned by microsoft -_-b. I’ll need a benchmark to persuade them to install Linux in the first place. And I can’t afford a G80 myself.

eirik · February 18, 2008, 12:01pm

Platform independent GUI does not mean GLUT. Take a look at Qt (Trolltech). You can the open source version and produce multiplatform GUI’s with native look-and-feel for each OS.

Topic		Replies	Views
CUDA performance on Linux Sample programs shows it's slower? CUDA Programming and Performance	9	10244	June 5, 2007
Big performace differece between Linux and Windows,is that normal? CUDA Programming and Performance	6	1673	December 19, 2019
Huge Linux vs XP performance boost with beta 2.0 CUDA Programming and Performance	6	6741	May 5, 2008
Speed difference for same CUDA code under Windows/Linux CUDA Programming and Performance	24	46427	March 17, 2010
Windows 7 vs Linux CUDA Programming and Performance	4	1066	August 5, 2013
Different times Ubuntu Vs Windows CUDA Programming and Performance	8	1808	October 12, 2015
Multi GPU results in latencies in Linux CUDA Programming and Performance	4	1986	April 25, 2012
163x performance boost on Fedora 28 vs Windows 10? CUDA Setup and Installation	7	674	February 1, 2019
Pinned memory throughput significantly lower on Ubuntu than on Windows CUDA Programming and Performance	23	479	December 17, 2025
Why these measurements so different? CUDA Programming and Performance	2	5842	December 19, 2011

Linux vs. Windows XP performance Ran an arbitrary benchmark

Related topics