Convincing skeptical bigwigs on the future of CUDA

seibert · March 14, 2009, 1:20am

Saving money and/or time is always a good argument. :)

jma · March 15, 2009, 4:09pm

You do realize that the 8800GT in the best of worlds maxes out at 300 Gflops? … and that this for your claim of a 175x speedup to hold implies that you are only getting - at most - 2 Gflops out of a 4 way Xeon which for single precision otherwise is potentially capable of around 100 Gflops.

This Xeon underutilization of 98% (or more) is what stands out the most here. Are you sure that your measuring methodology is correct, that you really have measured the time taken for your kernel to run to completion rather than say measured the time to launch it?

/skeptical_bean_counter

Demq · March 15, 2009, 9:41pm

This was simply a demonstration project of the GPU potential, thus I used only a single core on the Xenon vs the GPU. My timing method might have been a bit rough (even though both CPU and GPU kernels ran for a couple of minutes), but the GPU timing was calculated after the final memory transfer to the host (which presumably has to be performed after all the kernel executions are completed), which scaled linearly with the number of iterations of the code execution inside of the GPU kernel(iNMAXRepeat in the code below). The results were then compared with CPU calcs (identical code, just used a loop to go over every point). This code was heavy on arithmetic ops (sqrt, etc.).

[codebox] for(int iRepeat=0;iRepeat<iNMAXRepeat;iRepeat++)

{

	fTotARe=0, fTotAIm=0;

	float fCosTheta,fCosPhi,fP,cAmplRe,cAmplIm,fDenRe,fDenIm;

	//j=0;

	fP=sqrt(SQ(pf4Mom[12*i+0*4+3])-SQ(glMPi));		

	fCosTheta=pf4Mom[12*i+0*4+2]/(fP+0.00001);// pz/p

	fCosPhi=pf4Mom[12*i+0*4]/(fP*sqrt(1-SQ(fCosTheta))+0.00001);//  px/(p*sin(\theta))

	cAmplRe=fCosPhi*sqrt(1.-SQ(fCosTheta));

	cAmplIm=sqrt(1-fCosPhi)*sqrt(1.-SQ(fCosTheta));		

	fDenRe=SQ(glob_fMRho)

					+SQ(pf4Mom[12*i+1*4]+pf4Mom[12*i+2*4])-SQ(pf4Mom[12*i+1*4+1]+pf4Mom[12*i+2*4+1])

						-SQ(pf4Mom[12*i+1*4+2]+pf4Mom[12*i+2*4+2])-SQ(pf4Mom[12*i+1*4+3]+pf4Mom[12*i+2*4+3]);

	fDenIm=glWRho;		

	fTotARe+= (cAmplRe*fDenRe+cAmplIm*fDenIm)/(SQ(fDenRe)-SQ(fDenIm)+0.00001);

	fTotAIm+= (cAmplIm*fDenRe-cAmplRe*fDenIm)/(SQ(fDenRe)-SQ(fDenIm)+0.00001);

	

	//j=1

	fP=sqrt(SQ(pf4Mom[12*i+1*4+3])-SQ(glMPi));		

	fCosTheta=pf4Mom[12*i+1*4+2]/(fP+0.00001);// pz/p

	fCosPhi=pf4Mom[12*i+1*4]/(fP*sqrt(1-SQ(fCosTheta))+0.00001);//  px/(p*sin(\theta))

	cAmplRe=fCosPhi*sqrt(1.-SQ(fCosTheta));

	cAmplIm=sqrt(1-fCosPhi)*sqrt(1.-SQ(fCosTheta));		

	fDenRe=SQ(glob_fMRho)

					+SQ(pf4Mom[12*i+0*4]+pf4Mom[12*i+2*4])-SQ(pf4Mom[12*i+0*4+1]+pf4Mom[12*i+2*4+1])

						-SQ(pf4Mom[12*i+0*4+2]+pf4Mom[12*i+2*4+2])-SQ(pf4Mom[12*i+0*4+3]+pf4Mom[12*i+2*4+3]);

	fDenIm=glWRho;		

	fTotARe+= (cAmplRe*fDenRe+cAmplIm*fDenIm)/(SQ(fDenRe)-SQ(fDenIm)+0.00001);

	fTotAIm+= (cAmplIm*fDenRe-cAmplRe*fDenIm)/(SQ(fDenRe)-SQ(fDenIm)+0.00001);

	

	//j=2;

	fP=sqrt(SQ(pf4Mom[12*i+2*4+3])-SQ(glMPi));		

	fCosTheta=pf4Mom[12*i+2*4+2]/(fP+0.00001);// pz/p

	fCosPhi=pf4Mom[12*i+2*4]/(fP*sqrt(1-SQ(fCosTheta))+0.00001);//  px/(p*sin(\theta))

	cAmplRe=fCosPhi*sqrt(1.-SQ(fCosTheta));

	cAmplIm=sqrt(1-fCosPhi)*sqrt(1.-SQ(fCosTheta));		

	fDenRe=SQ(glob_fMRho)

					+SQ(pf4Mom[12*i+1*4]+pf4Mom[12*i+0*4])-SQ(pf4Mom[12*i+1*4+1]+pf4Mom[12*i+0*4+1])

						-SQ(pf4Mom[12*i+1*4+2]+pf4Mom[12*i+0*4+2])-SQ(pf4Mom[12*i+1*4+3]+pf4Mom[12*i+0*4+3]);

	fDenIm=glWRho;		

	fTotARe+= (cAmplRe*fDenRe+cAmplIm*fDenIm)/(SQ(fDenRe)-SQ(fDenIm)+0.00001);

	fTotAIm+= (cAmplIm*fDenRe-cAmplRe*fDenIm)/(SQ(fDenRe)-SQ(fDenIm)+0.00001);				

				

}

[/codebox]

I realize that this sort of test would heavily favor GPU, but it seems that arithmetical functions take a lot of time on a CPU. Maybe I am not using some sort of CPU optimization, but I use gcc 4.1.2 and -O3 optimization level on 64 bit systems.

To make convincing argument I actually also implemented a part of the real code from the project that had only a couple of operations per memory transfer, but got from 20xto 30x speedups depending on some parameters. Even this relatively small numbers were quite convincing, as the intention is to increase arithmetical sophistication of that part of the code.

Also, could you cite the source of the 100Gflop on single precision calcs?.

Cheers.

jack · March 15, 2009, 9:47pm

This graphic may actually be for double-precision LINPACK, but:

External Media

Demq · March 15, 2009, 9:58pm

Thanks, I have a naive question, why does the performance increase with the size of the problem? Smaller sizes simply run too fast to occupy the entire system?

jma · March 15, 2009, 11:52pm

Upgrade to gcc4.3.2 to get some of the advantages of auto-vectorization. ICC is sometimes - although not always - an even better choice.
Remember that constants like 0.0001 in C defaults to double precision unless you deliberately say they are not (as in 0.0001f) This kind of error is going to bite you on Nvidia gear as well, should you upgrade to something from the 200 series.
Functions like sqrt, sin and friends comes in different flavors. You may want to use the faster sqrtf, sinf etc for single precision.

With these optimizations, it is very likely that you could see a humble Intel Atom beat the crap out of your original run on the top-of-the-line Intel Xeon - which of course is ridiculous.

It is also not clear from your description whether you are taking the time after the result has been copied back to the host or after you have said that you would like this to happen at some future point in time. That would be an easy fix though:

NAME

	   cudaThreadSynchronize - wait for compute-device to finish

SYNOPSIS

	   cudaError_t cudaThreadSynchronize(void)

DESCRIPTION

	   Blocks until the device has completed all preceding requested tasks.

	   cudaThreadSynchronize() returns an error if one of the preceding tasks

	   failed.

RETURN VALUE

	   Relevant return values:

	   cudaSuccess

	   Note that this function may also return error codes from previous,

	   asynchronous launches.

SEE ALSO

	   cudaThreadExit

External Image

Demq · March 15, 2009, 11:58pm

I see this chart is from Intel’s LINPACK page. They usually quote single-precision benchmarks, as one can see from let’s say FFT benchmark vs FFTW http://www.intel.com/cd/software/products/…/eng/266852.htm . I remember this trick they did to show their lib’s superior speed, but FFTW wins in double-precision :).

One can see that the top performance in there is no more than ~30 Gflops with 8 threads on 3,4GHz quad-core! While theoretical max can be a large number, I don’t think it is possible to easily or practically achieve this on a typical problem.

Cheers

Demq · March 16, 2009, 12:37am

Upgrade to gcc4.3.2 to get some of the advantages of auto-vectorization. ICC is sometimes - although not always - an even better choice.

Remember that constants like 0.0001 in C defaults to double precision unless you deliberately say they are not (as in 0.0001f) This kind of error is going to bite you on Nvidia gear as well, should you upgrade to something from the 200 series.

Functions like sqrt, sin and friends comes in different flavors. You may want to use the faster sqrtf, sinf etc for single precision.

With these optimizations, it is very likely that you could see a humble Intel Atom beat the crap out of your original run on the top-of-the-line Intel Xeon - which of course is ridiculous.

It is also not clear from your description whether you are taking the time after the result has been copied back to the host or after you have said that you would like this to happen at some future point in time. That would be an easy fix though:

Thanks for pointing that out, I missed out on sqrtf, but just corrected and ran again , with all the numeric consts ending with “f” to assure compiler takes them as floats. At the moment I have an access to a machine with 3,16GHz 45nm Core2Duo with 8400GS. With the improvements the speedup went down to 8.3x for 8400GS. If I scale it up to 8800GT (14 SMP vs 1 in 8400) and from 3.16GHZ to 3.2, you still get ~110x. If you will be interested, I should soon get a quad-core i7 920 with GTX285, so I can post the benchmark on that one also.

Regarding time measurement, I was measuring the total time it takes to launch the kernel AND copy the data to the host device. I also timed the kernel with cudaStart/StopTimer using cudaThreadSynchronize (before the memory transfer), with very close result to the full time.

Regarding upgrading the compiler, I agree it might help, but I don’t think it will speed up the CPU code to 10x. ICC on the other hand seems quite capricious and rarely gives improvements on gcc, sometimes even slows down the code.

seibert · March 16, 2009, 2:40am

I think the point with #1 is that getting full float performance out of an Intel core requires the use of SSE registers, which can operate on 4 single precision floats at a time. This either means hand-coding SSE operations (using assembly, compiler primitives, or higher level functions like the AMD Vector Math Library) or using a compiler that can analyze your for loops and generate appropriate SSE instructions automatically. Those sorts of changes can win you a factor of 4 on data parallel algorithms. (Well, unless you are looping over enough data that the system RAM becomes the bottleneck.)

jma · March 16, 2009, 8:29am

Uhmm … No, not really. I would rather like to see compileable versions of your benchmarks. You are still at less than 50% of the expected performance of a C2D - even in single threaded mode and excluding SSE.

Sarnath · March 16, 2009, 9:14am

Why do you think so? What other factor than SSE affect your performance that much?

Coding style??

MisterAnderson42 · March 16, 2009, 9:50am

I cannot believe that nobody has mentioned memory bandwidth yet. CPU FLOPS can be severely underutilized because of the memory access pattern and dog slow memory (well compared to the GPUs humongous pipe). For instance, in my line of work the most finely tune compute kernel only pushes ~1.5-2 GFLOPS on AMD barcelona chips because of the random memory access pattern.

jma · March 16, 2009, 10:01am

That C2D has - with a reasonably cheapish motherboard - twice the bandwidth of an 8400GS as well as twice the theoretical peak-performance + “free” load/store and iterator increment/evaluation. Landing it at 1/8th rather than 2x indicates that something did not work out so well.

But proof is in the pudding. Let’s talk again when/if the benchmark becomes available.

SPWorley · March 16, 2009, 10:11am

Demq :

Maybe a bit offtopic… but when you post code, some of us immediately go into “analysis mode”, so I have a few GPU comments for you.

It’s no wonder the GPU loves your compute, it’s just powering through tons of straightforward FP math…
a rare case where the GPU just stomps through at full potential!

But… even though you’re so math-heavy, it’s possible you’re still memory limited at least partially. In CUDA, your memory access pattern isn’t particularly efficient… you seem to have a bunch of input values concatenated into one array and you’re eating them up 12 at a time. But when you access them like pf4Mom[12i+14]+pf4Mom[12i+24] you’re doing really inefficient memory reads… which are wasting 15/16 of the bandwidth.
If you transpose your memory storage into 12 arrays of N values instead of one array of N structures of 12 elements each, you’ll be accessing them at full speed. Also, it’s not obvious, but that 12*i compute is relatively expensive (!)… 1/4 the speed of a FP compute. This would be eliminated entirely by the 12 array method.

Such data rearrangements are common on the GPU, and the CPU too, you’ll often see “structure of arrays” versus “array of structures”, especially when doing SSE.

Rearranging your data will likely give you a decent speedup. But that’s not guaranteed, you’re doing so much juicy math that it’s possible the memory isn’t a bottleneck anyway. (Sometimes it’s nice to use RIVAtune to change your GPU’s clock speeds to see if your compute is sensitive to device memory speed, if it is, you have memory dependence.)

As another “is this rearrangement worthwhile” quick test, replace every array access in your calculation with a multiplication by some register float value. The output will be wrong, of course, but you can see if there’s any impact on computation speed since your simplification will not change the amount of computation done. If there isn’t any speed change, then you’re likely not memory limited. (For this quick test make sure the optimizer can’t collapse the computed value. Use something like float placeholder=threadIdx.x; as your multiplier so it can’t be simplified away at compile time.)

Sarnath · March 16, 2009, 10:19am

I am not a C2D specialist…I dont understand what “free” load/store mean? (like you store 2 variables continuously and the CPU would do the third store by itself – just like christmas sales in retail shops?) :-) Just jokin… I dont understad much by those statements… And, The iterator thing reminds me of C++

jma · March 16, 2009, 11:01am

Here you have most of what I know in one place:

http://arstechnica.com/hardware/news/2006/04/core.ars

Demq · March 16, 2009, 6:30pm

Pudding is served, Sir ;)

I tested for somewhat large n~ 1024000 and cpu/gpu iter of 100. I assume you should choose these parameters on your system such to let the program run for about a minute to have more or less reliable timing info.

Cheers

Demq · March 16, 2009, 6:33pm

Thanks for the tips, I am always glad to hear something useful. As for the code, I left those cumbersome index manipulations in place to make the code computationally heavier just to see how they compare on CPU and GPU.

jma · March 16, 2009, 9:36pm

Well … that isn’t working very well! I am moving them back to something like:

#define W8  256 /* 8 warps */

inline float sq(float x){return x * x;}

#define glMRho 0.70f

#define glMPi  0.14f

#define glWRho 0.10f

float fTotARe[W8],fTotAIm[W8];

float pf4Mom[3][4][W8];

cpu(/*under construction*/)

{

  int i; 

  for(i=0; i < W8; i++)

	{

	  float fP,fCosTheta,fCosPhi,cAmplRe,cAmplIm,fDenIm,fDenRe;

	  fTotARe[i]=0, fTotAIm[i]=0;

	  fP		= sqrtf(sq(pf4Mom[0][3][i]) - sq(glMPi));

	  fCosTheta = pf4Mom[0][2][i] / (fP + 0.00001f);

	  fCosPhi   = pf4Mom[0][0][i] / (fP * sqrtf(1.f - sq(fCosTheta)) + 0.00001f);

	  cAmplRe   = fCosPhi * sqrtf(1.f - sq(fCosTheta));

	  cAmplIm   = sqrtf(1.f - fCosPhi) * sqrtf(1.f - sq(fCosTheta));

	  fDenRe	= sq(glMRho)

	+ sq(pf4Mom[1][0][i] + pf4Mom[2][0][i]) 

	- sq(pf4Mom[1][1][i] + pf4Mom[2][1][i])

	- sq(pf4Mom[1][2][i] + pf4Mom[2][2][i]) 

	- sq(pf4Mom[1][3][i] + pf4Mom[2][3][i]);

	  fDenIm = glWRho;

	  fTotARe[i] += (cAmplRe*fDenRe + cAmplIm*fDenIm)

	/ (sq(fDenRe) - sq(fDenIm) + 0.00001f);

	  fTotAIm[i] += (cAmplIm*fDenRe - cAmplRe*fDenIm)

	/ (sq(fDenRe) - sq(fDenIm) + 0.00001f);

...

which with icc will both autovectorize and autothread … B)

Edit: Demq’s demo runs with this one shot invocation:

$ ./partial -g 1 -c 1 -n 4194304

… its timed loop in 3 repectively 0.75 seconds for gcc, single core on E1200 vs cuda 8 core on 8400GS. The cuda optimizer appears to do away with most of the innermost loop - possibly because it has no effect?

Rearranging the data for icc and 4 sse vector elements on one core brings the timed loop down to 1 second on the E1200 (which in turn is a C2D variant running at 1.6GHz.)

jma · March 17, 2009, 12:32pm

… and further down to 0.8 seconds using both cores.

Topic		Replies	Views
300x to 600x times faster... really? CUDA Programming and Performance	92	34806	February 8, 2010
Intel paper: Debunking the 100X GPU vs. CPU myth CUDA Programming and Performance	36	25553	April 7, 2011
cuda for ati cards we need a stadard CUDA Programming and Performance	27	43541	October 3, 2008
CUDA book by Kirk & Whu available CUDA Programming and Performance	44	12394	February 10, 2010
Is CUDA better than GLSLang? I need to know more... CUDA Programming and Performance	30	38751	July 13, 2007
advice needed by a PhD student CUDA Programming and Performance	26	3050	December 4, 2011
More details on new Tesla w/ Fermi GPU posted CUDA Programming and Performance	37	11618	December 12, 2009
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8784	December 18, 2008
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11433	May 23, 2010
Larbee description CUDA Programming and Performance	53	52436	December 14, 2009

Convincing skeptical bigwigs on the future of CUDA

Related topics