Hello everyone !
I’ve read on this forum lots of topics about cudaMallocPitch but none was dealing with such an easy program that the one I’m trying to do !
What I want to do is just to fill a 2D array on the device, to copy it on a array which is on the host and to print the last one.
I used the Programming guide to define how to program the device code and mostly ideas from the forum for the part about Memory !
Here is my code :
__global__ void test(float A[256][24])
{
int i= threadIdx.x;
int j= threadIdx.y;
A[i][j]=i+j;
}
float** compute2D()
{
int i=0,j=0;
float** h_result;
h_result = (float**) malloc (256*sizeof(float*));
for(i=0;i<256;i++) h_result[i] = (float*) malloc (24*sizeof(float));
float d_result[256][24];
size_t d_pitch=0,h_pitch=0;
//Allocation of the memory on device
cudaMallocPitch((void**)&d_result,&d_pitch,256*sizeof(float),24);
// We call the Kernell
dim3 dimBlock(256,24);
test<<<1, dimBlock>>>(d_result);
// We copy the array from the device to the host
cudaMemcpy2D(h_result,h_pitch,d_result,d_pitch,256*sizeof(float),24,cudaMemcpyDeviceToHost);
// We display the host array
for(i=0;i<256;i++) for(j=0;j<24;j++) printf("h_result[%d][%d] = %f\n",i,j,h_result[i][j]);}
return h_result;
}
The compilation works, but when I call my function, I get :
The programming guide says that, to loop over the array elements in device code I should do the following :
__global__ void myKernel(float* devPtr, int pitch)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}
But if I do so, every thread on the GPU will do the loop no ? What I want is that every thread does one !
The other source of problem may also be the pitch. I did the memory part like that because that’s what I saw on most of the topics but think I do not really understand what the pitch is used for.
EDIT :
Actually, if I do :
d_result[2][2]=3;
test<<<1, dimBlock>>>(d_result);
printf("%f\n",d_result[2][2]);
cudaMemcpy2D(h_result,h_pitch,d_result,d_pitch,256*sizeof(float),24,cudaMemcpyDeviceToHost);
printf (" %f\n",h_result[2][2]);
I obtain 3 for d_result[2][2] which means the value is not changed when the device function is called and 0 for h_result which means the Memcpy does not work either !
Thanks for your time !