Matrix manipulation, performance problem

My problem is the following :

I have a group of 2D matrix and I want to evaluate the max of each point of the matrix.

All the matrix have the same dimension.

To try to optimise performance, I use a linear representation of data like that:

If i have a matrix with i,j dimension and k matrix

[Data_i_j_k1, Data_i_j_k2 … ] …

I use 256 thread /bloc and i have tested with 1 to 1000 bloc (I don’t understand for the moment how to evaluate the best number block to use …)

My kernel fonction is the following :

__global__ void calculmaxLineaireGPU(int * domaine, int * maxRes2, int dimX, int nbMAT, int nbPoint)

{

	const unsigned int tidX = threadIdx.x;

	const unsigned int tidY = threadIdx.y; // allways 0, only one dimension used dimension

	const unsigned int bx   = blockIdx.x; 

	int indice = tidX + bx *dimX;                // + numTh; // *idBlocX;

	int tmp, val;

	int tab[100]; //  100 Matrix max

	__syncthreads();

	if (indice < nbPoint)

	{

  for (int i = 0; i < nbMAT; i++) {

  	tab[i] = domaine[indice*nbMAT + i];

  }

 __syncthreads();

  tmp = tab[0];

  for (int j = 1; j < nbMAT; j++){

  	tmp = fmaxf(tmp, tab[j]);

  }

  maxRes2[indice] = tmp;

	}

}

And, I call this kernel with this loop :

CUDA_SAFE_CALL( cudaMalloc( (void**) &gpuData, mem_gpu_ref));

	CUDA_SAFE_CALL( cudaMalloc( (void**) &gpuRes , mem_gpu_R_ref));

	while (colNonCalc > 0)

	{

  if (colNonCalc < nbCol_it){

  	colCalc = colNonCalc;

  	mem_gpu = colCalc * nbMAT * sizeof(int);

  	mem_gpu_R = colCalc * sizeof(int);

  	threads.x = colCalc;

  	

 }else{

  	colCalc = nbCol_it;

  	mem_gpu = mem_gpu_ref;

  	mem_gpu_R = mem_gpu_R_ref;

  }

  

  CUDA_SAFE_CALL( cudaMemcpy( gpuData, data, mem_gpu, cudaMemcpyHostToDevice) );

 // execute the kernel

  calculmaxLineaireGPU<<<grid, threads>>> (gpuData,gpuRes, nbTh, nbMAT, colCalc);

  // CUT_CHECK_ERROR("Kernel execution failed");

  CUDA_SAFE_CALL( cudaMemcpy( res, gpuRes, mem_gpu_R, cudaMemcpyDeviceToHost) );

  

  // On deplace le curseur

  colNonCalc -= colCalc;

  data += colCalc*nbMAT;

  res += colCalc;

	}

	CUDA_SAFE_CALL(cudaFree(gpuData));

	CUDA_SAFE_CALL(cudaFree(gpuRes));

This code is functionnal … but its performance are bad … the same with the CPU.

Why it’s wrong in my code (bad memory use, …) ?

I don’t understand why my performance are so bad :(

Thaks a lot

Beleys