240 versus 32 cores

Hi,

I’m developing a CSG evaluation module on Cuda. One part involves calculating distances from a large number of arbitrary

points to a given sphere as well as the local normal vector in these points relative to the sphere. An “embarrassingly parallel” problem.

I have an input array with points (float4 for alignment reasons) and an output array of float4’s containing the

(normalized) normal vector (x,y,z) and the distance to the sphere (w).

On a MacBookPro (GeForce 8600M GT, 32 cores@0,93Ghz) it takes about 0.09 (s) to process five million points.

On a (GTX285, 240 cores@1,48Ghz )-machine, it takes about 0.06 (s) for the same setup. Unnecessary to say I’m

a bit disappointed about the speedup.

Different blocksizes don’t do that much, if I stick to multiples of 64 (they seldom do, are they overrated or is it just me?),

occupancy is 1 in all cases,

no divergent branches,

no non-coalescing reads or writes,

No memcpy overhead because i’m using paged-memory (cudaMallocHost()),

With five million points and a blocksize of (say) 128 i would say i have enough data throughput to keep the GPU busy…

I’m running out of ideas…

Any thoughts?

Kind Regards,

Daniel Dekkers

PS, this is the kernel, i need a pitch and labels because p_CudaResultArray is actually a 2D array. One row is processed at a time by the GPU.

[codebox]global void CudaCSGSphereKernel_Normal

(const float4* p_CudaPointArray,

float4*		p_CudaResultArray,

const int p_Size,

const size_t p_Pitch,

const unsigned char p_Label,

const float p_Radius,

const float3 p_Center)

{

int l_PointIndex = (blockIdx.x * blockDim.x + threadIdx.x);

if (l_PointIndex<p_Size)

{

	float4 l_Result = p_CudaPointArray[l_PointIndex];

	float4* l_Row;

	l_Result.x -= p_Center.x;

	l_Result.y -= p_Center.y;

	l_Result.z -= p_Center.z;

	

	float l_Length = max(sqrt(l_Result.x*l_Result.x + l_Result.y*l_Result.y + l_Result.z*l_Result.z), 0.00001f);

	// Normalize...

	l_Result.x = l_Result.x/l_Length;

	l_Result.y = l_Result.y/l_Length;

	l_Result.z = l_Result.z/l_Length;

	l_Result.w = p_Radius - l_Length;

			

	l_Row = (float4*)((char*)p_CudaResultArray + p_Label*p_Pitch);

			

	l_Row[l_PointIndex] = l_Result;

}

}[/codebox]

For such a simple kernel, it is my experience that changing the block size accomplishes little. It does make a difference (I’ve seen up to 20-30% differences) in more complicated kernels.

Hmmm… You are not getting the performance that you should be. This kernel should be memory bandwidth bound. With 5 million threads, and 32 bytes read/written per thread I work that out to be 1.2ms of execution time assuming the hardware gives you 120 GiB/s of bandwidth

How are you benchmarking it? Does the CUDA profiler confirm a gputime of 0.06 seconds?

Are you compiling with -arch=sm_13? I’m not sure, but that might make the compiler turn the sqrt into a double precision calculation.

With so simple a code, there are so few points where the problem could be. Start removing code until you have just the memory read and the memory write. That kernel should definitely execute in around a millisecond.

Hi,

Thanks for the reply,

I removed the “calculation” so it just does a read and write (and a row calculation):

[codebox]global void CudaCSGSphereKernel_Normal

(const float4* p_CudaPointArray,

float4*		p_CudaResultArray,

const int p_Size,

const size_t p_Pitch,

const unsigned char p_Label,

const float p_Radius,

const float3 p_Center)

{

int l_PointIndex = (blockIdx.x * blockDim.x + threadIdx.x);

if (l_PointIndex<p_Size)

{

	float4 l_Result = p_CudaPointArray[l_PointIndex];

	float4* l_Row;

	// Calculations removed...

	l_Row = (float4*)((char*)p_CudaResultArray + p_Label*p_Pitch);				

	l_Row[l_PointIndex] = l_Result;

}

}[/codebox]

This gives the same time results on the cpu timer as with the calculation (i use a high frequency timer surrounding all the cuda stuff).

The time results on the GPU (from the Cuda profiler) show that the GTX285 is actually a lot faster (inside the kernel).

[font=“Courier New”]

MacBookPro (8600M GT):

  • With calculation

– cpu timer 0.09 (s)

– gpu timer 14400

– instructions 1836033

  • Without calculation

– cpu timer 0.09 (s)

– gpu timer 14500

– instructions 664112

GTX285:

  • With calculation

– cpu timer 0.06 (s)

– gpu timer 1491 <- !

– instructions 158650

  • Without calculation

– cpu timer 0.06 (s)

– gpu timer 1450 <- !

– instructions 41500

[/font]

But anyway, i added some extra (cpu) timers…

[font=“Courier”]Starting calculation…

cudaMallocPitch Stopwatch 0.0036927

cudaMalloc Stopwatch 0.0007839

cudaMemcpy Stopwatch 0.0113300

Kernel invocation Stopwatch 0.0000995

cudaMemcpy Stopwatch 0.0140232

cudaFree Stopwatch 0.0007843

cudaFree Stopwatch 0.0035836

BlockSize: 256

Time result (2000000) Intel 1.088953 Cuda 0.034351 Factor 31.7

No errors in results found…[/font]

So it really is bandwidth, the two cudaMemcpy’s (although the kernel invocation is asynchronous with respect to the host i think (?), so that timer is not correct).

That is somewhat bad news. I have to make the kernels bigger and/or get rid of the Memcpy’s somehow but don’t know how to do that yet…

Kind Regards,

Daniel Dekkers

… continued …

I do find it strange though that the Cuda Profiler doesn’t show these cudaMemcpy()'s. Is it because I use cudaMallocHost() to allocate memory on the host instead of a normal malloc()? I remember they “disappeared” after i changed that.

… changing them back…

And they re-appear in the Cuda Profiler.
So… why is this?

Ahh, I see now. You are including the host <-> device memcpy times. Not that there is anything wrong with that, it is measuring what you need to do for the full application, I just misunderstood what you were timing.

The reason for the performance is this:

GTX 285 memory bandwidth: 160 GiB/s

PCI express memcpy bandwidth: 3-6 GiB/s

That is two orders of magnitude difference!

I avoid host to device mem copies by putting everything in the application that I can on the GPU. That way, data produced in one kernel is left on the GPU to be used in the next kernel call. You can also get faster host <-> device memory bandwidth by allocating your memory with cudaMallocHost instead of malloc/new.

Oh, and one additional thing. To get accurate timings of a kernel execution, you need to:

cudaThreadSynchronize()

measure time1

kernel<<<...>>>(...)

cudaThreadSynchronize()

measure time2

as kernel launches are asynchronous.

The CUDA profiler has known issues that it misses some mem copies. The feature sheet of the upcoming CUDA 2.2 says that fewer are missing, but that it still doesn’t catch all of them. Consider it a bug that NVIDIA is aware of :)

that’s a weird case, could you print the grid dimensions just in case? Also, have you checked that the output is what you want? Sometimes that’s a good way to avoid making trivial mistakes.

For the future, you probably want to coalesce the float4 loading with some shared memory, but that should make the gt200 even faster by comparison. You should consider the fastmath functions unless you need the precision.

btw., consider using the [ code ] tags in phpbb instead of codebox to avoid double scrollbars.

Well, actually, it’s not that weird. If you compare the two…

Starting calculation, 2000000 evaluations on GeForce 8600M GT (32 cores)…
cudaMallocPitch Stopwatch 0.0186630
cudaMalloc Stopwatch 0.0001439
cudaMemcpy Stopwatch 0.0180274 (H->D 32000000)
Kernel invocation Stopwatch 0.0392187
cudaMemcpy Stopwatch 0.0292054 (D->H 32000000)
cudaFree Stopwatch 0.0002919
cudaFree Stopwatch 0.0003160
BlockSize: 256
Time result, Intel 1.270977 Cuda 0.105943 Factor 12.0
No errors in results found…

Starting calculation, 2000000 evaluations on GeForce GTX 285 (240 cores)…
cudaMallocPitch Stopwatch 0.0243068
cudaMalloc Stopwatch 0.0008246
cudaMemcpy Stopwatch 0.0244863 (H->D 32000000)
Kernel invocation Stopwatch 0.0042787
cudaMemcpy Stopwatch 0.0307422 (D->H 32000000)
cudaFree Stopwatch 0.0008616
cudaFree Stopwatch 0.0033959
BlockSize: 256
Time result, Intel 1.033028 Cuda 0.088956 Factor 11.6
No errors in results found…

It just shows the kernel invocation is indeed a factor 9 higher on the GTX 285 (I added the cudaThreadSynchronize()'s so the timers are now accurate).
And it shows that the MacBookPro has high bandwidth and fast allocation :-) (i don’t use page-locked memory in this case)

Actually, i’m going to use it for animations. The first Mallocs and Memcpy’s are not really a problem, because they are just setup. My problem is the second D->H Memcpy because these are the results (per frame) for the animation. So my next step is now to see to what degree OpenGL can use the data on the GPU directly. And (indeed) i could look at trying to store the data in shared memory after setup, so during the animation i can use that…

Kind Regards,

Daniel Dekkers

… with page-locked memory using cudaMallocHost(), helps:

Starting calculation, 2000000 evaluations on GeForce GTX 285 (240 cores)…

cudaMallocPitch Stopwatch 0.0038843

cudaMalloc Stopwatch 0.0008222

cudaMemcpy Stopwatch 0.0113152 (H->D 32000000) <-!

Kernel invocation Stopwatch 0.0042112

cudaMemcpy Stopwatch 0.0097950 (D->H 32000000) <-!

cudaFree Stopwatch 0.0007738

cudaFree Stopwatch 0.0033927

BlockSize: 256

Time result, Intel 1.032240 Cuda 0.034231 Factor 30.2

No errors in results found…