Slow performance with long arrays

Hi forum

Im working in the kernel with an array to get the results, but when the array is too long the app works to slow (again).

When this array have a length greater than 340000 elements, the gpu take a long time to perform any operation.

In this moment I have this code in the kernel

__global__ void cudaEvaluate(int size, int nb_msgs, int* dev_Results)
{
	int tid=0,x=0,y=0, grid_width=0;
	x = threadIdx.x + blockIdx.x * blockDim.x;
	y = threadIdx.y + blockIdx.y * blockDim.y;
		
	grid_width = gridDim.x * blockDim.x;
	tid = y*grid_width+x;

	if (tid > = size)
		return;

	dev_Results[tid] = 0;

        for (int i=0; i < nb_msgs; i++)
	{
            dev_Results[tid] = tid; //This is only to make a test
        }
}

Now, as the vector grows the time to process the kernel is greater.

How can I do to get a better performance when this array is every time greater

This even is slow without the ‘for’.
Thanks

How are the threads and blocks organized for this solution? It is much important to properly define the execution grid to achieve best performance.

MK

P.S.
Where do You have ‘dev_Success_Msg’ defined? I can’t see the definition in the code sample.

Hi cmaster.matso

  1. I have change the var dev_Success_Msg in the code above.
  2. For the differents solutions I use this function
int array_length = 550000000;
Nb_Errors_To_Simulate = 8;

void Get_GridSize(....)
{	
	.... //here I define some vars
	
	cudaDeviceProp props;
	HANDLE_ERROR(cudaGetDeviceProperties(&props, 0));
	
	//data is a struct

	globalmem_usable = props.totalGlobalMem;
	
	globalmem_usable -= sizeof(my_objects)*count_objects //later I subtract the size of the objects to use into de gpu
	
	
	Nb_Supported_Blocks = Round2Int(globalmem_usable/(sizeof(unsigned int) * Nb_Errors_To_Simulate) / props.maxThreadsPerBlock, &int_out_of_range);
	
	Nb_Required_Blocks = array_length / props.maxThreadsPerBlock;

	if (array_length <= props.maxGridSize[0])
	{
		data[i].Grid_X	= array_length;
		data[i].Grid_Y	= 1;
		data[i].Threads_per_Block = 1;
	}
	else
	if (Nb_Required_Blocks > props.maxGridSize[0] && Nb_Required_Blocks <= Nb_Supported_Blocks)
	{
		data[i].Grid_X = sqrt(Nb_Required_Blocks);
		data[i].Grid_Y = sqrt(Nb_Required_Blocks);
				
		data[i].Threads_per_Block = props.maxThreadsPerBlock;
		
		for (;;)
		{
			if ( double(data[i].Grid_X * data[i].Grid_Y * data[i].Threads_per_Block * Nb_Errors_To_Simulate * sizeof(unsigned int)) >= (globalmem_usable - data[i].Grid_X * data[i].Grid_Y * data[i].Threads_per_Block * Nb_Errors_To_Simulate * sizeof(unsigned int)))
			{

				data[i].Grid_Y--;
				
				if (data[i].Grid_X < props.maxGridSize[0])
					data[i].Grid_X++;
			}
			else
				break;
		}
		else
		{
			tmp_nb_threads = array_length/Nb_Required_Blocks;
			
			if (tmp_nb_threads > props.maxThreadsPerBlock)
				data[i].Threads_per_Block = props.maxThreadsPerBlock;
			else
				data[i].Threads_per_Block = tmp_nb_threads;
					
				data[i].Grid_X = Nb_Required_Blocks;
				data[i].Grid_Y = 1;
		}
....
}

I dont know why when the grid size is [65535*1]512threads works better than the grid size of [255255]*512 threads.

with the first, an kernel operation can take 3hours but with the second can take til 10 hours.

Thanks

Hi

Can some one tell me which is the best configuration in a big grid?

Thanks