 # Matrix manipulation, performance problem

My problem is the following :

I have a group of 2D matrix and I want to evaluate the max of each point of the matrix.

All the matrix have the same dimension.

To try to optimise performance, I use a linear representation of data like that:

If i have a matrix with i,j dimension and k matrix

[Data_i_j_k1, Data_i_j_k2 … ] …

I use 256 thread /bloc and i have tested with 1 to 1000 bloc (I don’t understand for the moment how to evaluate the best number block to use …)

My kernel fonction is the following :

``````__global__ void calculmaxLineaireGPU(int * domaine, int * maxRes2, int dimX, int nbMAT, int nbPoint)

{

const unsigned int tidX = threadIdx.x;

const unsigned int tidY = threadIdx.y; // allways 0, only one dimension used dimension

const unsigned int bx   = blockIdx.x;

int indice = tidX + bx *dimX;                // + numTh; // *idBlocX;

int tmp, val;

int tab; //  100 Matrix max

if (indice < nbPoint)

{

for (int i = 0; i < nbMAT; i++) {

tab[i] = domaine[indice*nbMAT + i];

}

tmp = tab;

for (int j = 1; j < nbMAT; j++){

tmp = fmaxf(tmp, tab[j]);

}

maxRes2[indice] = tmp;

}

}
``````

And, I call this kernel with this loop :

``````CUDA_SAFE_CALL( cudaMalloc( (void**) &gpuData, mem_gpu_ref));

CUDA_SAFE_CALL( cudaMalloc( (void**) &gpuRes , mem_gpu_R_ref));

while (colNonCalc > 0)

{

if (colNonCalc < nbCol_it){

colCalc = colNonCalc;

mem_gpu = colCalc * nbMAT * sizeof(int);

mem_gpu_R = colCalc * sizeof(int);

}else{

colCalc = nbCol_it;

mem_gpu = mem_gpu_ref;

mem_gpu_R = mem_gpu_R_ref;

}

CUDA_SAFE_CALL( cudaMemcpy( gpuData, data, mem_gpu, cudaMemcpyHostToDevice) );

// execute the kernel

calculmaxLineaireGPU<<<grid, threads>>> (gpuData,gpuRes, nbTh, nbMAT, colCalc);

// CUT_CHECK_ERROR("Kernel execution failed");

CUDA_SAFE_CALL( cudaMemcpy( res, gpuRes, mem_gpu_R, cudaMemcpyDeviceToHost) );

// On deplace le curseur

colNonCalc -= colCalc;

data += colCalc*nbMAT;

res += colCalc;

}

CUDA_SAFE_CALL(cudaFree(gpuData));

CUDA_SAFE_CALL(cudaFree(gpuRes));
``````

This code is functionnal … but its performance are bad … the same with the CPU.

Why it’s wrong in my code (bad memory use, …) ?

I don’t understand why my performance are so bad :(

Thaks a lot

Beleys