Zeroes and Large Arrays Strange GPU behaviour

techshadow · August 7, 2009, 3:29pm

Hi everyone. I’m new to the boards. I’ve been looking through old threads for similar problems to my own but I haven’t found anything that seems to directly relate to the problems I’m having. Sorry if this has been covered before and I missed it.

I’m currently working on a small program that finds the distance between vectors in 100-dimensional space using a simple Euclidian distance algorithm (i.e. the square-root of the sum of squared differences). This is being done with a very large array of values (1,000,000 * 100 floats at this point). The strange behaviour I’m getting is that when I try to run my program with 1,000,000 vectors, when I use cudaMemcpy to move the results from the device to the host and output the results to the scren, everything is 0. When I try it with 100,000 vectors the results are correct when compared with the CPU code for about 30,000 entries but then begin to repeat the same value over and over (i.e. if the results were correct for 30,000 values the 30,001+ values are the same as the 30,000th value).

The GPU has 500 MB of memory and the above configuration would result in a minimum of 400MB of memory being used so I’m thinking that I may be trying to load too much into the GPU’s memory.

My questions are three fold:

Has anyone else experienced the problem of having all zeroes or a repeating value being output when using very large amounts of memory on the GPU?
Is it possible that using too much memory could be responsible for the behaviour I’m seeing?
I plan to scale this up to 1,000,000,000 vectors in the future which will definitely be too large for the GPU’s memory. Am I right to think that I will need to make multiple kernel calls to load the data in and out?

Thanks in advance!

seibert · August 7, 2009, 7:31pm

Are you checking for error conditions in the kernel launch? It sounds like for the larger case, the kernel never launches, or is aborted for some reason. What grid and block size are you using for the kernel?

techshadow · August 10, 2009, 10:56pm

Hi Seibert. Sorry for the late response. I was away over the weekend. Anyway, I added more error checking to my code and you were exactly right. The kernel was not launching due to an improperly defined grid / block size. My code was actually in such a bad state that I decided to start over with something similar but a little easier.

I was wondering if anyone can have a look at the code I’m currently working on and give me some tips. What I’m doing is adding up every value in a row of values in parallel using a parallel reduction algorithm. There are 1000 vectors each with 100 values to be added (for convenience I’ve simply made all of the values the same so that every vector, when added, should provide the same result). I get the correct values but the GPU implementation is about 20 times slower than the CPU version and I’m not really sure why. Originally I had loaded each vector into shared memory and each block would compute the result from that, but this quickly used up the shared memory and I received an error message about “too much shared data”. I gave up on that and just started using a device array called tuples to store all of the vectors but, as I said, the result of this technique is 20 times slower than with the CPU implementation. :-(

I’ve read everything I can from the documentation and listened to the lectures provided on the website, but I just can’t seem to get any CUDA programs to outperform their CPU counterparts. I would really appreciate any help that anyone can offer.

include <stdio.h>

include <stdlib.h>

include <time.h>

include <cutil_inline.h>

include <cutil.h>

include <math.h>

define MAX_BLOCK_SIZE 512 // Max threads per block

define NUM_VECTORS 10000

define UNPADDED_VECTOR_LENGTH 100 // No padding

define VECTOR_LENGTH 128 // With padding to ensure it is a multiple of 16

define BLOCK_SIZE (VECTOR_LENGTH)/2 // Make the block size half the length of a vector to facilitate reduction

// stores vectors to be reduced

device float tuples[VECTOR_LENGTH*NUM_VECTORS];

// Kernel

global void performCalculation(float *results)

{

int tx = threadIdx.x;

int ty = threadIdx.y;

int bx = blockIdx.x;

int by = blockIdx.y;

for(int stride = BLOCK_SIZE; stride > 0; stride >>= 1)

{

	__syncthreads();

	if(tx < stride)

	{

		tuples[by*VECTOR_LENGTH+tx] += tuples[by*VECTOR_LENGTH+tx+stride];

	}

}

__syncthreads();

results[by*VECTOR_LENGTH] = tuples[by*VECTOR_LENGTH];

}

float *golden_performCalculation(float *vectors)

{

float *result = (float*)calloc(VECTOR_LENGTH*NUM_VECTORS, sizeof(float));

for(int i = 0; i < NUM_VECTORS; i++)

{

	for(int j = 0; j < VECTOR_LENGTH; j++)

	{

		result[i] += vectors[i*VECTOR_LENGTH + j];

	}

}

return result;

}

int main(void)

{

unsigned int timer;

cutCreateTimer(&timer);

float *result = (float*)malloc(sizeof(float)*VECTOR_LENGTH*NUM_VECTORS);

float *devResult;

cudaMalloc((void**)&devResult, sizeof(float)*VECTOR_LENGTH*NUM_VECTORS);

float *comparison = (float*)malloc(sizeof(float)*VECTOR_LENGTH*NUM_VECTORS);

memset(comparison, 0, VECTOR_LENGTH*NUM_VECTORS);

for(int i = 0; i < NUM_VECTORS; i++)

{

	for(int j = 0; j < UNPADDED_VECTOR_LENGTH; j++)

	{

		comparison[i*VECTOR_LENGTH + j] = j;

	}

}

cutilSafeCall(cudaMemcpyToSymbol(tuples, comparison, sizeof(float)*VECTOR_LENGTH*NUM_VECTORS));

cutStartTimer(timer);

dim3 dimGrid(1, NUM_VECTORS);

dim3 dimBlock(BLOCK_SIZE, 1);

performCalculation<<<dimGrid, dimBlock>>>(devResult);

cudaMemcpy(result, devResult, sizeof(float)*VECTOR_LENGTH*NUM_VECTORS, cudaMemcpyDeviceToHost);

cutStopTimer(timer);

printf("\nGPU elapsed time was %f ms.\n", cutGetTimerValue(timer));

/* for(int i = 0; i < NUM_VECTORS; i++)

{

	printf("\ngpuResult[%d] = %f", i, result[i*VECTOR_LENGTH]);

}*/

cutResetTimer(timer);

cutStartTimer(timer);

float* cpuResult;

cpuResult = golden_performCalculation(comparison);

cutStopTimer(timer);

printf("\nCPU elapsed time was %f ms.\n", cutGetTimerValue(timer));

// Cleanup

free(result);

free(comparison);

cudaFree(devResult);

}

Edit: I forgot to mention, it seems that the majority of the slowness seems to be from the cudaMemcpy from the device to the host after the calculations have been performed. If I place the cutStartTimer and cutStopTimer statements around that single line of code it shows that it is taking almost a full 20ms by itself. :-o Am I doing something wrong or is it normal for it to be that slow?