Optimization help

Hi, I decided to go back to basics and write a simple program to find the mean of a vector of numbers. I figured this would be a good way to learn. I’m finding that my GPU version of the code runs ~3x slower than the host version, primarily due to the cost of copying data from the host to the device. I’m not sure if there is anyway to get around this, or whether I am actually using cudaMallocPitch() correctly.

My algorithm is to divide the data into DATA_PER_THREAD sized chunks, and find the average of those chunks in parallel across a set of THREADS_PER_BLOCK threads running in a block. The average of each thread’s data is written to shared memory. Then each block writes the average of this temporary shared memory result to a global block_avg array. The host program then takes the average of the block_avg array to find the average of the entire data set.

Any pointers as to how I go about optimizing the following code would be appreciated.

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <cutil.h>

#include <time.h>

#define DATA_PER_THREAD  10000

#define THREADS_PER_BLOCK	64

__device__ float	*d_data;

__device__ float	*block_avg;

__global__ void	cudaProcess(float *data, float *block_avg, size_t pitch)

{

	extern __shared__ float  sdata[];

	int  i;

	float	*dp;

	float	avg;

	dp = (float *) ((char *) data + (blockIdx.x * THREADS_PER_BLOCK + threadIdx.x) * pitch);

	avg = 0.0f;

	for (i=0; i<DATA_PER_THREAD; i++)

	{

  avg += *dp;

  dp++;

	}

	avg /= (float) DATA_PER_THREAD;

	sdata[threadIdx.x] = avg;

	__syncthreads();

	if (threadIdx.x == 0)

	{

  // The block average now needs to be stored globally

  // Only one thread needs to do this (idx = 0)

 avg = 0.0f;

  for (i=0; i<THREADS_PER_BLOCK; i++) avg += sdata[i];

  avg /= (float) THREADS_PER_BLOCK;

  block_avg[blockIdx.x] = avg;

	}

}

float	process (float *data, unsigned int N)

{

	int  blocks;

	dim3	block(THREADS_PER_BLOCK, 1, 1);

	dim3	grid;

	int  sbytes;

	float	*host_block_avg;

	unsigned int  i, j;

	float	avg;

	size_t	pitch;

	blocks = N / (THREADS_PER_BLOCK * DATA_PER_THREAD);  // Fails if N is not a multiple!

	grid.x = blocks;

	grid.y = 1;

	grid.z = 1;

	// In shared memory we store each thread's average

	sbytes = sizeof(float) * THREADS_PER_BLOCK;

	// In global memory we store each block's average

	cudaMalloc((void **) &block_avg, sizeof(float) * blocks);

	host_block_avg = (float *) malloc (sizeof(float) * blocks);

	// Allocate & copy the data over to the device

	// Using MallocPitch to facilitate coalesced memory access

	cudaMallocPitch((void **) &d_data, &pitch, sizeof(float) * DATA_PER_THREAD, blocks * THREADS_PER_BLOCK);

	for (j=0; j<blocks; j++)

	for (i=0; i<THREADS_PER_BLOCK; i++)

	{

  void	*d_vp;

  d_vp = (void *) ((char *) d_data + (j*THREADS_PER_BLOCK + i) * pitch);

  cudaMemcpy(d_vp, 

    	&data[(j * THREADS_PER_BLOCK * DATA_PER_THREAD) + (i * DATA_PER_THREAD)], 

    	sizeof(float) * DATA_PER_THREAD, cudaMemcpyHostToDevice);

	}

	cudaProcess <<<grid, block, sbytes>>> (d_data, block_avg, pitch);

	// Copy the results back to the host

	cudaMemcpy(host_block_avg, block_avg, sizeof(float) * blocks, cudaMemcpyDeviceToHost);

	// Summarize the results

	avg = 0.0f;

	for (i=0; i<blocks; i++) avg += host_block_avg[i];	

	avg /= (float) blocks;

	cudaFree(block_avg);

	cudaFree(d_data);

	return (avg);

}

int main(int argc, char **argv)

{

	unsigned int  N = 6400000;

	unsigned int	i,j;

	int    T = 100;

	time_t  	ts, te;

	float	*h_data;

	float	avg;

	h_data = (float *) malloc (sizeof(float) * N);

	for (i=0; i<N; i++) h_data[i] = i;

	// Do it on the host T times

	ts = time(NULL);

	for (j=0; j<T; j++)

	{

  avg = 0.0f;

  for (i=0; i<N; i++) avg += h_data[i];

  avg /= (float) N;

	}

	te = time(NULL);

	printf ("%d items, average = %6.4f\n", N, avg);

	printf ("Host time : %d sec\n", te - ts);

	ts = time(NULL);

	for (j=0; j<T; j++)

	{

  avg = process (h_data, N);

	}

	te = time(NULL);

	printf ("CUDA result = %6.4f\n", avg);

	printf ("GPU time  : %d sec\n", te - ts);

	return (0);	

}

I have trouble to understand what you are exactly doing, but I have a very simple recipe for how to do this fast:

Take the reduction example from the SDK.

Change the following line(this is reduce6_kernel, the fastest version of reduction from the example):

      sdata[tid] += g_idata[i] + g_idata[i+blockSize];  

into

      sdata[tid] += (g_idata[i] + g_idata[i+blockSize])/((float) n);  

And you will have the average instead of the sum. It will run at full memory bandwith just like the reduction example.

OK - I’ve read through the reduction code & docs, and I see some useful tips. However if you profile the example, you will find the exact same problem that I’m finding, namely, that cudaMemcpyDeviceToHost takes the bulk of the time. In the example, they start the timer after the memcpy, but if you include that time the speed of my code is comparable (I’m 2x slower, but running the CPU version is still 4x faster than the reduction demo)…

So I’m guessing that if the biggest bottleneck is Memcpy, then CUDA is really only worthwhile for things that are more CPU/Device Bandwidth intensive than O(N) operations.

My concern is that I was really looking forward to doing some operations on stuff in GL framebuffers, and my playing around there seems to indicate that copying from a framebuffer to global memory is about as slow as device -> host.

How many values are you taking the average of? If it is not a lot, you have trouble with the overhead, and if you are timing the first call, you have initialization overhead that you are also timing. But I cannot imagine that a CPU version can come close to the reduction example for large numbers of values, given that the GPU version will do something like 70 GB/s, so more than 10 Gfloats/s.

Also you might check the bandwithtest output, maybe there is some trouble with your throughput to/from the device.

The framebuffer -> global slowness is a known ‘bug’. I do not remember someone from NVIDIA ever stating it would be fixed (or recognising it is a bug), but I think it sure is known to NVIDIA.

Only if you actually need to copy the data back to the host. If you can put all steps of a long computation onto the GPU and avoid the device->host copy until the end you can still realize speedups for cheap O(N) calculations.

Additionally, you should look into using pinned memory (cudaMallocHost). Using it you can copy host<->device significantly faster.

Try it out with the new CUDA 2.0 beta when it is posted. It has been said that OpenGL interop performance has been improved. If in CUDA 2.0, the framebuffer updates are only limited by the device bandwidth, you should be able to update the framebuffer at 70 GiB/s.

For both tests I was taking the average of 25.6M floats.

I re-wrote my test to be closer to my actual use case, which is computing the average color of a framebuffer. Here I use RGB packed into a 24bit int. In this case, the GPU is marginally faster (even including the time to Memcpy).

I haven’t looked into pinned malloc - but I’ll check it out - thanks.

I found that the biggest problem with my memcpy was that I was using cudaMallocPitch to facilitate coalesced reads from global memory. However, this meant that instead of just doing a single cudaMemcpy, I had to do one for each ‘pitch’. Given that my layout already ensured coalesced reads (I think), I got rid of that code and just used a regular cudaMalloc and a single cudaMemcpy. This is still slow, but substantially faster than calling cudaMemcpy multiple times.

Use cudaMemcpy2D, it takes the pitch as a parameter.

If you aren’t using cudaMallocPitch and your pitch is not a multiple of 16 floats, you are hurting your kernel’s performance by a factor of ~10-20 by not coalescing reads. This of course assumes that you are indexing the memory as if it were a 2D matrix. If you are treating the whole block of memory as one big 1D dataset and averaging everything, then coalescing is easy and you don’t need cudaMallocPitch or cudaMemcpy2D.

Yes, I’m treating everything as a 1D array, and everything is accessed by a multiple of 16bytes, so I believe my memory access is coalesced. Is there any easy way to check?

Also, I took advantage of being able to read 4 int’s in a single instruction which gave me a boost.

My biggest boost, as you suggested came from using page locked host memory (cudaMallocHost).

My cuda code is now significantly faster than the host version. (3x faster, even when including the memory overhead).

My source code is available at average.cu

The visual profiler (http://forums.nvidia.com/index.php?showtopic=58283), can be used to easily determine if your memory reads are coalesced or not.

A little too easy: High incoherent load rate of 66.4534 for method ‘cudaProcess’ :(

I guess my understanding of coalescing is incorrect. But now I think I understand.

PS: Thanks for all your help so far, this has been great!

So you’ll have fun when you will have only coalesced reads :D
I find the visual profiler one of the most useful tools around CUDA, now I just need a debugger ;)

Yeah, it fixed my coalesced read problem immediately. Well, I fixed it, but the profiler made it glaring.

Of course, given that the bulk of my code time was being spent in HostToDevice memcpy the total run time of my kernel only reduced slightly. From 90/10% memcpy/process, my code went to 95/5% :)

Given that my actual use case is processing data that is read from a GL framebuffer, I’ll take another stab at PBO’s tomorrow. The problem I had last time was that working with PBO’s seemed slow, but hopefully it is still faster than dealing with a HostToDevice memcpy.

And it would be the smartest anyway, given that it seems the performance-trouble of PBO-global mem copy should be fixed in 2.0

For posterity, here is my final* version. It processes 8 billion RGB ints per second (not including memcpy time). I think thats pretty close to bandwidth.

/*

** Code for finding the average RGB value of an array. Each RGB value is stored in the lower 3 bytes of

** each integer.

*/

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <cutil.h>

#include <time.h>

// Chosen by guesswork - probably should use the spreadsheet

#define DATA_PER_THREAD  512

#define THREADS_PER_BLOCK	64

__device__ int  *d_data;  	// The device's copy of the data to be averaged

__device__ int  *block_avg;  	// Temp space for storing the average of the data processed by

          // each block.

__device__ int rgbToInt(float r, float g, float b)

{

    return (int(r)<<16) | (int(g)<<8) | int(b);

}

int rgbToIntH(float r, float g, float b)

{

    return (int(r)<<16) | (int(g)<<8) | int(b);

}

__global__ void	cudaProcessNew(int *data, int *block_avg)

{

	/*

	** Reads from data and writes to block_avg.

	** Optimized to do R&B together, and G seperate. 33% faster than R, G, B.

	*/

	extern __shared__ int	sdata[];  // Shared memory used for storing thread averages

	int  i;

	int4	*blockBase;

	int4	D;

	int  rb,g, c;

	float	R,G,B;

	// Take advantage of single instruction 128bit reads by reading in an int4 at a time	

	// NB: Coalesced reads.

	blockBase = (int4 *) ((char *) data + blockIdx.x * THREADS_PER_BLOCK * DATA_PER_THREAD) + threadIdx.x;

	R = B = G = 0.0f;

	for (i=0; i<(DATA_PER_THREAD>>2); i++)

	{

  D = *(int4 *) (blockBase + i*THREADS_PER_BLOCK);

 rb = g = 0;

 c = D.x; rb += c & 0xff00ff; g += c & 0x00ff00;

  c = D.y; rb += c & 0xff00ff; g += c & 0x00ff00;

  c = D.z; rb += c & 0xff00ff; g += c & 0x00ff00;

  c = D.w; rb += c & 0xff00ff; g += c & 0x00ff00;

 R += (rb >> 18) & 0xff;

  G += (g >> 10) & 0xff;

  B += (rb >> 2) & 0xff;

	}

	R /= (float) (DATA_PER_THREAD>>2);

	G /= (float) (DATA_PER_THREAD>>2);

	B /= (float) (DATA_PER_THREAD>>2);

	c = rgbToInt(R,G,B);

	sdata[threadIdx.x] = c;

	__syncthreads();

	if (threadIdx.x == 0)

	{

  // The block average now needs to be stored globally.

  // Only one thread needs to do this (idx = 0)

 int	r,b;

 R = G = B = 0.0f;

  for (i=0; i<THREADS_PER_BLOCK; i++)

  {

  	c = sdata[i];

  	r = (c >> 16) & 0xff;

  	g = (c >>  8) & 0xff;	

  	b = c & 0xff;

  	R += r;

  	G += g;

  	B += b;

  }

  R /= (float) THREADS_PER_BLOCK;

  G /= (float) THREADS_PER_BLOCK;

  B /= (float) THREADS_PER_BLOCK;

  c = rgbToInt(R,G,B);

  block_avg[blockIdx.x] = c;

	}

}

int	process (int *data, unsigned int N)

{

	int  blocks;

	dim3	block(THREADS_PER_BLOCK, 1, 1);

	dim3	grid;

	int  sbytes;

	int  *host_block_avg;

	size_t	i;

	int  r,g,b, c;

	float	R,G,B;

	blocks = N / (THREADS_PER_BLOCK * DATA_PER_THREAD);  // Fails if N is not a multiple!

	grid.x = blocks;

	grid.y = 1;

	grid.z = 1;

	// In shared memory we store each thread's average

	sbytes = sizeof(int) * THREADS_PER_BLOCK;

	// In global memory we store each block's average

	cudaMalloc((void **) &block_avg, sizeof(int) * blocks);

	host_block_avg = (int *) malloc (sizeof(int) * blocks);

	// Allocate & copy the data over to the device

/*

      HANDLE THIS IN main() FOR TIMING.

	cudaMalloc((void **) &d_data, sizeof (int) * N);

    cudaMemcpy(d_data, data, sizeof(int) * N, cudaMemcpyHostToDevice);

*/

	cudaProcessNew <<<grid, block, sbytes>>> (d_data, block_avg);

	// Copy the results back to the host

	cudaMemcpy(host_block_avg, block_avg, sizeof(int) * blocks, cudaMemcpyDeviceToHost);

	// Summarize the results

	R = G = B = 0.0f;

	for (i=0; i<blocks; i++) 

	{

  c = host_block_avg[i];	

  r = (c >> 16) & 0xff;

  g = (c >>  8) & 0xff;	

  b = c & 0xff;

  R += r;

  G += g;

  B += b;

	}

	R /= (float) blocks;

	G /= (float) blocks;

	B /= (float) blocks;

	c = rgbToIntH(R,G,B);

	cudaFree(block_avg);

//	cudaFree(d_data);    HANDLED IN main()

	return (c);

}

int	processGold (int *data, size_t N)

{

	size_t	i;

	int  r,g,b, c;

	float	R,G,B;

	R = G = B = 0.0f;

	for (i=0; i<N; i++)

	{

  c = data[i];

  r = (c >> 16) & 0xff;

  g = (c >>  8) & 0xff;

  b = c & 0xff;

  R += r;

  G += g;

  B += b;	

	}

	R /= (float) N;

	G /= (float) N;

	B /= (float) N;

	c = rgbToIntH(R,G,B);

	return (c);

}

int main(int argc, char **argv)

{

	size_t  	N = 2560000;

	unsigned int	i,j;

	int    T = 20000;

	time_t  	ts, te;

	int    *h_data;

	int    result;

	printf ("Items: %d [%6.2fMB]\n", N, (sizeof(int)*N)/(1024.0*1024.0));

	// Use page locked ('pinned') host memory for faster access from device

	cudaMallocHost((void **)&h_data, sizeof(int) * N);

        // Fill with test data - should produce result of 0x00ABCD7F

	for (i=0; i<N; i++) h_data[i] = 0xABCD00 + ((i % 2) ? 0x00 : 0xFF);

/*

	// Do it on the host T times

	ts = time(NULL);

	for (j=0; j<T; j++)

	{

  result = processGold(h_data, N);

	}

	te = time(NULL);

	printf ("CPU  result = %06X\n", result);

	printf ("Host time : %d sec\n", te - ts);

*/

	cudaMalloc((void **) &d_data, sizeof (int) * N);

    cudaMemcpy(d_data, h_data, sizeof(int) * N, cudaMemcpyHostToDevice);

	ts = time(NULL);

	for (j=0; j<T; j++)

	{

  result = process (h_data, N);

	}

	te = time(NULL);

	printf ("CUDA result = %06X\n", result);

	printf ("GPU time  : %d sec\n", te - ts);

	cudaFree(d_data);

	return (0);	

}
  • Final, until someone finds another obvious optimization :)