Convincing skeptical bigwigs on the future of CUDA

Saving money and/or time is always a good argument. :)

You do realize that the 8800GT in the best of worlds maxes out at 300 Gflops? … and that this for your claim of a 175x speedup to hold implies that you are only getting - at most - 2 Gflops out of a 4 way Xeon which for single precision otherwise is potentially capable of around 100 Gflops.

This Xeon underutilization of 98% (or more) is what stands out the most here. Are you sure that your measuring methodology is correct, that you really have measured the time taken for your kernel to run to completion rather than say measured the time to launch it?

/skeptical_bean_counter

This was simply a demonstration project of the GPU potential, thus I used only a single core on the Xenon vs the GPU. My timing method might have been a bit rough (even though both CPU and GPU kernels ran for a couple of minutes), but the GPU timing was calculated after the final memory transfer to the host (which presumably has to be performed after all the kernel executions are completed), which scaled linearly with the number of iterations of the code execution inside of the GPU kernel(iNMAXRepeat in the code below). The results were then compared with CPU calcs (identical code, just used a loop to go over every point). This code was heavy on arithmetic ops (sqrt, etc.).

[codebox] for(int iRepeat=0;iRepeat<iNMAXRepeat;iRepeat++)

{

	fTotARe=0, fTotAIm=0;

	float fCosTheta,fCosPhi,fP,cAmplRe,cAmplIm,fDenRe,fDenIm;

	//j=0;

	fP=sqrt(SQ(pf4Mom[12*i+0*4+3])-SQ(glMPi));		

	fCosTheta=pf4Mom[12*i+0*4+2]/(fP+0.00001);// pz/p

	fCosPhi=pf4Mom[12*i+0*4]/(fP*sqrt(1-SQ(fCosTheta))+0.00001);//  px/(p*sin(\theta))

	cAmplRe=fCosPhi*sqrt(1.-SQ(fCosTheta));

	cAmplIm=sqrt(1-fCosPhi)*sqrt(1.-SQ(fCosTheta));		

	fDenRe=SQ(glob_fMRho)

					+SQ(pf4Mom[12*i+1*4]+pf4Mom[12*i+2*4])-SQ(pf4Mom[12*i+1*4+1]+pf4Mom[12*i+2*4+1])

						-SQ(pf4Mom[12*i+1*4+2]+pf4Mom[12*i+2*4+2])-SQ(pf4Mom[12*i+1*4+3]+pf4Mom[12*i+2*4+3]);

	fDenIm=glWRho;		

	fTotARe+= (cAmplRe*fDenRe+cAmplIm*fDenIm)/(SQ(fDenRe)-SQ(fDenIm)+0.00001);

	fTotAIm+= (cAmplIm*fDenRe-cAmplRe*fDenIm)/(SQ(fDenRe)-SQ(fDenIm)+0.00001);

	

	//j=1

	fP=sqrt(SQ(pf4Mom[12*i+1*4+3])-SQ(glMPi));		

	fCosTheta=pf4Mom[12*i+1*4+2]/(fP+0.00001);// pz/p

	fCosPhi=pf4Mom[12*i+1*4]/(fP*sqrt(1-SQ(fCosTheta))+0.00001);//  px/(p*sin(\theta))

	cAmplRe=fCosPhi*sqrt(1.-SQ(fCosTheta));

	cAmplIm=sqrt(1-fCosPhi)*sqrt(1.-SQ(fCosTheta));		

	fDenRe=SQ(glob_fMRho)

					+SQ(pf4Mom[12*i+0*4]+pf4Mom[12*i+2*4])-SQ(pf4Mom[12*i+0*4+1]+pf4Mom[12*i+2*4+1])

						-SQ(pf4Mom[12*i+0*4+2]+pf4Mom[12*i+2*4+2])-SQ(pf4Mom[12*i+0*4+3]+pf4Mom[12*i+2*4+3]);

	fDenIm=glWRho;		

	fTotARe+= (cAmplRe*fDenRe+cAmplIm*fDenIm)/(SQ(fDenRe)-SQ(fDenIm)+0.00001);

	fTotAIm+= (cAmplIm*fDenRe-cAmplRe*fDenIm)/(SQ(fDenRe)-SQ(fDenIm)+0.00001);

	

	//j=2;

	fP=sqrt(SQ(pf4Mom[12*i+2*4+3])-SQ(glMPi));		

	fCosTheta=pf4Mom[12*i+2*4+2]/(fP+0.00001);// pz/p

	fCosPhi=pf4Mom[12*i+2*4]/(fP*sqrt(1-SQ(fCosTheta))+0.00001);//  px/(p*sin(\theta))

	cAmplRe=fCosPhi*sqrt(1.-SQ(fCosTheta));

	cAmplIm=sqrt(1-fCosPhi)*sqrt(1.-SQ(fCosTheta));		

	fDenRe=SQ(glob_fMRho)

					+SQ(pf4Mom[12*i+1*4]+pf4Mom[12*i+0*4])-SQ(pf4Mom[12*i+1*4+1]+pf4Mom[12*i+0*4+1])

						-SQ(pf4Mom[12*i+1*4+2]+pf4Mom[12*i+0*4+2])-SQ(pf4Mom[12*i+1*4+3]+pf4Mom[12*i+0*4+3]);

	fDenIm=glWRho;		

	fTotARe+= (cAmplRe*fDenRe+cAmplIm*fDenIm)/(SQ(fDenRe)-SQ(fDenIm)+0.00001);

	fTotAIm+= (cAmplIm*fDenRe-cAmplRe*fDenIm)/(SQ(fDenRe)-SQ(fDenIm)+0.00001);				

				

}	

[/codebox]

I realize that this sort of test would heavily favor GPU, but it seems that arithmetical functions take a lot of time on a CPU. Maybe I am not using some sort of CPU optimization, but I use gcc 4.1.2 and -O3 optimization level on 64 bit systems.

To make convincing argument I actually also implemented a part of the real code from the project that had only a couple of operations per memory transfer, but got from 20xto 30x speedups depending on some parameters. Even this relatively small numbers were quite convincing, as the intention is to increase arithmetical sophistication of that part of the code.

Also, could you cite the source of the 100Gflop on single precision calcs?.

Cheers.

This graphic may actually be for double-precision LINPACK, but:

External Media

Thanks, I have a naive question, why does the performance increase with the size of the problem? Smaller sizes simply run too fast to occupy the entire system?

  1. Upgrade to gcc4.3.2 to get some of the advantages of auto-vectorization. ICC is sometimes - although not always - an even better choice.

  2. Remember that constants like 0.0001 in C defaults to double precision unless you deliberately say they are not (as in 0.0001f) This kind of error is going to bite you on Nvidia gear as well, should you upgrade to something from the 200 series.

  3. Functions like sqrt, sin and friends comes in different flavors. You may want to use the faster sqrtf, sinf etc for single precision.

With these optimizations, it is very likely that you could see a humble Intel Atom beat the crap out of your original run on the top-of-the-line Intel Xeon - which of course is ridiculous.

It is also not clear from your description whether you are taking the time after the result has been copied back to the host or after you have said that you would like this to happen at some future point in time. That would be an easy fix though:

NAME

	   cudaThreadSynchronize - wait for compute-device to finish

SYNOPSIS

	   cudaError_t cudaThreadSynchronize(void)

DESCRIPTION

	   Blocks until the device has completed all preceding requested tasks.

	   cudaThreadSynchronize() returns an error if one of the preceding tasks

	   failed.

RETURN VALUE

	   Relevant return values:

	   cudaSuccess

	   Note that this function may also return error codes from previous,

	   asynchronous launches.

SEE ALSO

	   cudaThreadExit

External Image

I see this chart is from Intel’s LINPACK page. They usually quote single-precision benchmarks, as one can see from let’s say FFT benchmark vs FFTW http://www.intel.com/cd/software/products/…/eng/266852.htm . I remember this trick they did to show their lib’s superior speed, but FFTW wins in double-precision :).

One can see that the top performance in there is no more than ~30 Gflops with 8 threads on 3,4GHz quad-core! While theoretical max can be a large number, I don’t think it is possible to easily or practically achieve this on a typical problem.

Cheers

Thanks for pointing that out, I missed out on sqrtf, but just corrected and ran again , with all the numeric consts ending with “f” to assure compiler takes them as floats. At the moment I have an access to a machine with 3,16GHz 45nm Core2Duo with 8400GS. With the improvements the speedup went down to 8.3x for 8400GS. If I scale it up to 8800GT (14 SMP vs 1 in 8400) and from 3.16GHZ to 3.2, you still get ~110x. If you will be interested, I should soon get a quad-core i7 920 with GTX285, so I can post the benchmark on that one also.

Regarding time measurement, I was measuring the total time it takes to launch the kernel AND copy the data to the host device. I also timed the kernel with cudaStart/StopTimer using cudaThreadSynchronize (before the memory transfer), with very close result to the full time.

Regarding upgrading the compiler, I agree it might help, but I don’t think it will speed up the CPU code to 10x. ICC on the other hand seems quite capricious and rarely gives improvements on gcc, sometimes even slows down the code.

I think the point with #1 is that getting full float performance out of an Intel core requires the use of SSE registers, which can operate on 4 single precision floats at a time. This either means hand-coding SSE operations (using assembly, compiler primitives, or higher level functions like the AMD Vector Math Library) or using a compiler that can analyze your for loops and generate appropriate SSE instructions automatically. Those sorts of changes can win you a factor of 4 on data parallel algorithms. (Well, unless you are looping over enough data that the system RAM becomes the bottleneck.)

Uhmm … No, not really. I would rather like to see compileable versions of your benchmarks. You are still at less than 50% of the expected performance of a C2D - even in single threaded mode and excluding SSE.

Why do you think so? What other factor than SSE affect your performance that much?

Coding style??

I cannot believe that nobody has mentioned memory bandwidth yet. CPU FLOPS can be severely underutilized because of the memory access pattern and dog slow memory (well compared to the GPUs humongous pipe). For instance, in my line of work the most finely tune compute kernel only pushes ~1.5-2 GFLOPS on AMD barcelona chips because of the random memory access pattern.

That C2D has - with a reasonably cheapish motherboard - twice the bandwidth of an 8400GS as well as twice the theoretical peak-performance + “free” load/store and iterator increment/evaluation. Landing it at 1/8th rather than 2x indicates that something did not work out so well.

But proof is in the pudding. Let’s talk again when/if the benchmark becomes available.

Demq :

Maybe a bit offtopic… but when you post code, some of us immediately go into “analysis mode”, so I have a few GPU comments for you.

It’s no wonder the GPU loves your compute, it’s just powering through tons of straightforward FP math…
a rare case where the GPU just stomps through at full potential!

But… even though you’re so math-heavy, it’s possible you’re still memory limited at least partially. In CUDA, your memory access pattern isn’t particularly efficient… you seem to have a bunch of input values concatenated into one array and you’re eating them up 12 at a time. But when you access them like pf4Mom[12i+14]+pf4Mom[12i+24] you’re doing really inefficient memory reads… which are wasting 15/16 of the bandwidth.
If you transpose your memory storage into 12 arrays of N values instead of one array of N structures of 12 elements each, you’ll be accessing them at full speed. Also, it’s not obvious, but that 12*i compute is relatively expensive (!)… 1/4 the speed of a FP compute. This would be eliminated entirely by the 12 array method.

Such data rearrangements are common on the GPU, and the CPU too, you’ll often see “structure of arrays” versus “array of structures”, especially when doing SSE.

Rearranging your data will likely give you a decent speedup. But that’s not guaranteed, you’re doing so much juicy math that it’s possible the memory isn’t a bottleneck anyway. (Sometimes it’s nice to use RIVAtune to change your GPU’s clock speeds to see if your compute is sensitive to device memory speed, if it is, you have memory dependence.)

As another “is this rearrangement worthwhile” quick test, replace every array access in your calculation with a multiplication by some register float value. The output will be wrong, of course, but you can see if there’s any impact on computation speed since your simplification will not change the amount of computation done. If there isn’t any speed change, then you’re likely not memory limited. (For this quick test make sure the optimizer can’t collapse the computed value. Use something like float placeholder=threadIdx.x; as your multiplier so it can’t be simplified away at compile time.)

I am not a C2D specialist…I dont understand what “free” load/store mean? (like you store 2 variables continuously and the CPU would do the third store by itself – just like christmas sales in retail shops?) :-) Just jokin… I dont understad much by those statements… And, The iterator thing reminds me of C++

Here you have most of what I know in one place:

http://arstechnica.com/hardware/news/2006/04/core.ars

Pudding is served, Sir ;)

I tested for somewhat large n~ 1024000 and cpu/gpu iter of 100. I assume you should choose these parameters on your system such to let the program run for about a minute to have more or less reliable timing info.

Cheers

Thanks for the tips, I am always glad to hear something useful. As for the code, I left those cumbersome index manipulations in place to make the code computationally heavier just to see how they compare on CPU and GPU.

Well … that isn’t working very well! I am moving them back to something like:

#define W8  256 /* 8 warps */

inline float sq(float x){return x * x;}

#define glMRho 0.70f

#define glMPi  0.14f

#define glWRho 0.10f

float fTotARe[W8],fTotAIm[W8];

float pf4Mom[3][4][W8];

cpu(/*under construction*/)

{

  int i; 

  for(i=0; i < W8; i++)

	{

	  float fP,fCosTheta,fCosPhi,cAmplRe,cAmplIm,fDenIm,fDenRe;

	  fTotARe[i]=0, fTotAIm[i]=0;

	  fP		= sqrtf(sq(pf4Mom[0][3][i]) - sq(glMPi));

	  fCosTheta = pf4Mom[0][2][i] / (fP + 0.00001f);

	  fCosPhi   = pf4Mom[0][0][i] / (fP * sqrtf(1.f - sq(fCosTheta)) + 0.00001f);

	  cAmplRe   = fCosPhi * sqrtf(1.f - sq(fCosTheta));

	  cAmplIm   = sqrtf(1.f - fCosPhi) * sqrtf(1.f - sq(fCosTheta));

	  fDenRe	= sq(glMRho)

	+ sq(pf4Mom[1][0][i] + pf4Mom[2][0][i]) 

	- sq(pf4Mom[1][1][i] + pf4Mom[2][1][i])

	- sq(pf4Mom[1][2][i] + pf4Mom[2][2][i]) 

	- sq(pf4Mom[1][3][i] + pf4Mom[2][3][i]);

	  fDenIm = glWRho;

	  fTotARe[i] += (cAmplRe*fDenRe + cAmplIm*fDenIm)

	/ (sq(fDenRe) - sq(fDenIm) + 0.00001f);

	  fTotAIm[i] += (cAmplIm*fDenRe - cAmplRe*fDenIm)

	/ (sq(fDenRe) - sq(fDenIm) + 0.00001f);

...

which with icc will both autovectorize and autothread … B)

Edit: Demq’s demo runs with this one shot invocation:

$ ./partial -g 1 -c 1 -n 4194304

… its timed loop in 3 repectively 0.75 seconds for gcc, single core on E1200 vs cuda 8 core on 8400GS. The cuda optimizer appears to do away with most of the innermost loop - possibly because it has no effect?

Rearranging the data for icc and 4 sse vector elements on one core brings the timed loop down to 1 second on the E1200 (which in turn is a C2D variant running at 1.6GHz.)

… and further down to 0.8 seconds using both cores.