Slow performance with long arrays

Hi forum

Im working in the kernel with an array to get the results, but when the array is too long the app works to slow (again).

When this array have a length greater than 340000 elements, the gpu take a long time to perform any operation.

In this moment I have this code in the kernel

__global__ void cudaEvaluate(int size, int nb_msgs, int* dev_Results)
	int tid=0,x=0,y=0, grid_width=0;
	x = threadIdx.x + blockIdx.x * blockDim.x;
	y = threadIdx.y + blockIdx.y * blockDim.y;
	grid_width = gridDim.x * blockDim.x;
	tid = y*grid_width+x;

	if (tid > = size)

	dev_Results[tid] = 0;

        for (int i=0; i < nb_msgs; i++)
            dev_Results[tid] = tid; //This is only to make a test

Now, as the vector grows the time to process the kernel is greater.

How can I do to get a better performance when this array is every time greater

This even is slow without the ‘for’.

How are the threads and blocks organized for this solution? It is much important to properly define the execution grid to achieve best performance.


Where do You have ‘dev_Success_Msg’ defined? I can’t see the definition in the code sample.

Hi cmaster.matso

  1. I have change the var dev_Success_Msg in the code above.
  2. For the differents solutions I use this function
int array_length = 550000000;
Nb_Errors_To_Simulate = 8;

void Get_GridSize(....)
	.... //here I define some vars
	cudaDeviceProp props;
	HANDLE_ERROR(cudaGetDeviceProperties(&props, 0));
	//data is a struct

	globalmem_usable = props.totalGlobalMem;
	globalmem_usable -= sizeof(my_objects)*count_objects //later I subtract the size of the objects to use into de gpu
	Nb_Supported_Blocks = Round2Int(globalmem_usable/(sizeof(unsigned int) * Nb_Errors_To_Simulate) / props.maxThreadsPerBlock, &int_out_of_range);
	Nb_Required_Blocks = array_length / props.maxThreadsPerBlock;

	if (array_length <= props.maxGridSize[0])
		data[i].Grid_X	= array_length;
		data[i].Grid_Y	= 1;
		data[i].Threads_per_Block = 1;
	if (Nb_Required_Blocks > props.maxGridSize[0] && Nb_Required_Blocks <= Nb_Supported_Blocks)
		data[i].Grid_X = sqrt(Nb_Required_Blocks);
		data[i].Grid_Y = sqrt(Nb_Required_Blocks);
		data[i].Threads_per_Block = props.maxThreadsPerBlock;
		for (;;)
			if ( double(data[i].Grid_X * data[i].Grid_Y * data[i].Threads_per_Block * Nb_Errors_To_Simulate * sizeof(unsigned int)) >= (globalmem_usable - data[i].Grid_X * data[i].Grid_Y * data[i].Threads_per_Block * Nb_Errors_To_Simulate * sizeof(unsigned int)))

				if (data[i].Grid_X < props.maxGridSize[0])
			tmp_nb_threads = array_length/Nb_Required_Blocks;
			if (tmp_nb_threads > props.maxThreadsPerBlock)
				data[i].Threads_per_Block = props.maxThreadsPerBlock;
				data[i].Threads_per_Block = tmp_nb_threads;
				data[i].Grid_X = Nb_Required_Blocks;
				data[i].Grid_Y = 1;

I dont know why when the grid size is [65535*1]512threads works better than the grid size of [255255]*512 threads.

with the first, an kernel operation can take 3hours but with the second can take til 10 hours.



Can some one tell me which is the best configuration in a big grid?