2D array & Memory space Mostly about cudaMallocPitch & cudaMemcpy2D

Hello everyone !

I’ve read on this forum lots of topics about cudaMallocPitch but none was dealing with such an easy program that the one I’m trying to do !

What I want to do is just to fill a 2D array on the device, to copy it on a array which is on the host and to print the last one.

I used the Programming guide to define how to program the device code and mostly ideas from the forum for the part about Memory !

Here is my code :

__global__ void test(float A[256][24])

{

  int i= threadIdx.x;

  int j= threadIdx.y;

  A[i][j]=i+j;

}

float** compute2D()

{

	int i=0,j=0;	

	float** h_result;

	h_result = (float**) malloc (256*sizeof(float*));

	for(i=0;i<256;i++) h_result[i] = (float*) malloc (24*sizeof(float));

float d_result[256][24];

size_t d_pitch=0,h_pitch=0;

//Allocation of the memory on device

  cudaMallocPitch((void**)&d_result,&d_pitch,256*sizeof(float),24);

// We call the Kernell

  dim3 dimBlock(256,24);

  test<<<1, dimBlock>>>(d_result);

	// We copy the array from the device to the host

	 cudaMemcpy2D(h_result,h_pitch,d_result,d_pitch,256*sizeof(float),24,cudaMemcpyDeviceToHost);

	 // We display the host array

	 for(i=0;i<256;i++)  for(j=0;j<24;j++)  printf("h_result[%d][%d] = %f\n",i,j,h_result[i][j]);}

return h_result;

}

The compilation works, but when I call my function, I get :

The programming guide says that, to loop over the array elements in device code I should do the following :

__global__ void myKernel(float* devPtr, int pitch)

{

  for (int r = 0; r < height; ++r) {

			  float* row = (float*)((char*)devPtr + r * pitch);

			  for (int c = 0; c < width; ++c) {

							  float element = row[c];

			  }

   }

}

But if I do so, every thread on the GPU will do the loop no ? What I want is that every thread does one !

The other source of problem may also be the pitch. I did the memory part like that because that’s what I saw on most of the topics but think I do not really understand what the pitch is used for.

EDIT :

Actually, if I do :

d_result[2][2]=3;

  test<<<1, dimBlock>>>(d_result);

  printf("%f\n",d_result[2][2]);

  cudaMemcpy2D(h_result,h_pitch,d_result,d_pitch,256*sizeof(float),24,cudaMemcpyDeviceToHost); 

  printf (" %f\n",h_result[2][2]);

I obtain 3 for d_result[2][2] which means the value is not changed when the device function is called and 0 for h_result which means the Memcpy does not work either !

Thanks for your time !

Ok, my device program was totally wrong !

For those who are interested, the good way of doing so is :

__global__ void test(float *p, size_t pitch, int H, int W){

  int idx=threadIdx.x + blockIdx.x* blockDim.x;

  int idy=threadIdx.y + blockIdx.y* blockDim.y;

  int index =idx + idy*W;

  if ( idx < W && idy < H ) p[index] = 1;

And for the rest of the problem, it’s probably that to compute an array of 256*24, I’m not supposed to does it in that way am I right ?!

Because it works for an array of 30*24 but not for more, I will try to find where does the problem come from ! If somebody has any clue…