Problem with matrix size


I have performed a simple algorithm of variety computation.It consists in taking N x N matrix. For each element, I check the four nearest neighbours. If it is different than the current element, I add one to this element.

My algorithm works correctly. I noticed that if the size is a multiple of 128 x 128 or 256 x 256, the time of the execution is very high, it does a peak. I don’t understand because I have used 2 different algorithms, one naive and one optimised and I have the same problem.

There is the graphic of the different algorithms

Somebody can help me to find the problem ??


Maybe some code would be helpful… So we can see if there is some strange thing happening in you code.

__global__ void naive_variete_gpu(int *var, int* tab, int N)


   unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

   unsigned int j = blockDim.y * blockIdx.y + threadIdx.y;

   if (i != 0 && j != 0 && i < N - 1 && j < N - 1)


	int haut;

	int bas;

	int gauche;

	int droite;

	int h_gauche;

	int h_droite;

	int b_gauche;

	int b_droite;

	int pix;

	int compt;

     compt = 0;

  	pix = tab[N*i + j];

  	h_gauche = tab[(i-1)* N + (j-1)];

  	haut = tab[i * N + (j-1)];

  	h_droite = tab[(i+1) * N + (j-1)];

  	gauche = tab[(i-1) * N + j];

  	droite = tab[(i+1) * N + j];

  	b_gauche = tab[(i-1) * N + (j+1)];

  	bas = tab[i * N + (j+1)];

  	b_droite = tab[(i+1) * N + (j+1)];

 	if (pix != h_gauche)


  	if (pix != haut && haut != h_gauche)


  	if (pix != h_droite && h_droite != h_gauche && h_droite != haut)


  	if (pix != gauche && gauche != h_gauche && gauche != haut && gauche != h_droite)


  	if (pix != droite && droite != h_gauche && droite != haut && droite != h_droite && droite != gauche)


  	if (pix != b_gauche && b_gauche != h_gauche && b_gauche != haut && b_gauche != h_droite && b_gauche != gauche && b_gauche != droite)


  	if (pix != bas && bas != h_gauche && bas != haut && bas != h_droite && bas != gauche && bas != droite && bas != b_gauche)


  	if (pix != b_droite && b_droite != h_gauche && b_droite != haut && b_droite != h_droite && b_droite != gauche && b_droite != droite && b_droite != b_gauche && b_droite != bas)


  	var[N * i + j] = compt;



	var[N * i + j] = 0;


It is my code for the naive algorithm

Can you also post your kernel invocation please

It is my kernel invocation

//naive variete

	for(cpt = 1;cpt < 2; cpt++)


  int N_BLOCK;  


 int par_grid;

  par_grid = ceil((double)N / N_BLOCK);

 dim3 grid(par_grid, par_grid, 1);

  dim3 threads(N_BLOCK, N_BLOCK, 1);

 deb = clock();

  //allocation des tableau sur la carte graphique

  int* d_idata;

  int* d_odata;

  CUDA_SAFE_CALL( cudaMalloc( (void**) &d_idata, mem_size));

  CUDA_SAFE_CALL( cudaMalloc( (void**) &d_odata, mem_size));


  // copy host memory to device


  CUDA_SAFE_CALL( cudaMemcpy(d_idata, tab, mem_size,

        	cudaMemcpyHostToDevice) );

 //execution du prog sur GPU

 dcalc = clock();

  naive_variete_gpu <<< grid, threads >>> (d_odata, d_idata, N);


  fcalc = clock();


 //copie des données depuis la carte graphique sur le dd


  CUDA_SAFE_CALL( cudaMemcpy( varg, d_odata, mem_size,

        	cudaMemcpyDeviceToHost) );


  fin = clock();


Hmm, my best guess is that it has something to do with coalesced memory access or something like that, I don’t see any obvious problems.

What you can do is 2 things (and I would do them in this order) :

  • Run a fast config and a slow config using the (visual) profiler, and log all information possible. Then you can compare the number of uncoalesced accesses and also divergent branches.
  • Access tab by means of a 2D texture (which is very likely a speedup anyway for you, so I would highly recommend it)

In your kernel you have too many if calls. Your kernel very unplesant for SIMD architecture because you got many different threads with different instructions in one thread block.

I wouldn’t worry about the if statements. With all the memory accesses in your kernel, your performance is most certainly memory bound.

DenisR is correct in that your problem is memory coalescing. Given that all of your memory accesses are using global memory pointers and the way you access them, not all of them will be coalesced. However, at certain dimensions, some of the memory accesses will be coalesced leading to the performance spikes you see.

You can use the CUDA visual profiler (download from the forum sticky post) to count the number of uncoalesced memory accesses vs coalesced ones. If you are unaware, the difference in performance can be an order of magnitude. See the programming guide for all the gory details on how to coalesce.

Things to do to significantly boost the performance of your kernel.

  1. Your access pattern for reading “tab” is perfect for the the 2D texture cache.
  2. Make that the final write to “var” is coalesced. To do this, you will need to use cudaMallocPitch to allocate your 2D memory with some padding at the end of each row to ensure coalescing.

Edit: I forgot to add that if you want to see how optimally you are using the device, count the number of memory bytes you read/write. Then divide by the running time of the kernel to calculate an effective GiB/s bandwidth usage. Something near ~70GiB/s should be attainable in your case.

Well, i don’t know if there are too many if calls (they are not helping for performance offcourse) but all threads take all the if calls, so there is no really big serializing problem (only a compt++ is performed, there is no else)

And I would find it highly strange if they were the cause of the strange drops in performance.

If in one half-warp you have one successful if statement for a whole warp you takes 1/16 of possible performance, (for a 2 clock cycles you have just 1 useful instruction for the warp instead of 16). If 4 - 1/4 etc. I suppose drops in perfomance not for this but if operators are very unplesant for perfomance too specialy in first 3-4 if’s, but later 2 clocks of comp++ hided by many equal checks.

In your code it looks as if the 32 threads within a warp have the same value of j and consecutive values of i – see pages 7, 8 in the Programming Guide for how the 2D thread indices map to the 1D index which is then cut up into warps.

Looking at the way you are referencing the elements of tab, not only are they not coalesced, but consecutive i values correspond to a stride of N in accessing the device memory. If N=128 or N=256 it wouldn’t surprise me if this means it’s hammering a single bank within the device memory and that is leading to the worst possible access pattern.

To test this hypothesis, you can change the access Ni + j to i + Nj, so consecutive i values correspond to consecutive memory references.

This doesn’t fix the coalescence problem – that will require more work.