problem with cudaMallocPitch and cudaMemcpy2D

cirus · July 29, 2009, 6:48am

I wrote a simple script that places a 2D array of numbers in device and then retrieves them.

I pass an input grid of dimensions H * W.

For H*W elements i.e when H = 3 , W = 4 , the input is

a_h[0][0] = 0;

a_h[0][1] = 1;

.

.

.

a_h[2][3] = 11;

A strange behavior is observed.

If H = 4 and W = 3 , there is a crash when I am about to display the retrieved data. The crash occurs at b_h[3][0]. Incomplete data is displayed correctly till b_h[2][2]
Where as if H = 3 and W = 4 , the program executes successfully and correct result is displayed.
When I allocate device memory, for H by W matrix, I am getting pitch value as 64.
For square matrices viz: H = W , the program works fine.
for cases like H=4, W =3 ad H = 6 , W = 7 the program prints incomplete results and crashes. I tried to change parameters of cudaMemcpy2D(), where

//Error = cudaMemcpy2D(b_h, W * sizeof(float) , r_d ,  pitch_r , W * sizeof(float) , H , cudaMemcpyDeviceToHost );

  Error = cudaMemcpy2D(b_h, pitch_r , r_d ,  pitch_r , pitch_r , H , cudaMemcpyDeviceToHost );

if I un-comment the first one and comment the second , I get incomplete but correct result, resulting in crash at b_h[3][0].

If I use second statement instead of first , my program does not crash but gives correct result till b_h[2][2] and then all zeros till b_h[3][2].

Q1. Can you help me where is the flaw?

For simplicity I am using 1D grid of blocks. Earlier I used 2D grid of blocks but was not getting correct results.

Kernel Code

__global__ void Sum(float *d_a, float *d_c, float* result, float* sum, int blockSize, size_t pitch, int N, int H, int W)

{

int idx = threadIdx.x + blockIdx.x * blockDim.x;

  int idy = threadIdx.y + blockIdx.y * blockDim.y;

//int byt = pitch / sizeof(float);

if(idx < N)

  result[idy * pitch + idx] = d_a[idy * pitch + idx];

}//end of Sum()

Main function

int main(void)

{

  float** a_h = NULL, **b_h = NULL;

  float *a_d, *r_d, *b_d , *c_d, *sum_d;

int *h_summation = NULL;

//Error 

  cudaError_t Error; 

//counters

  int i = 0 , j = 0;  

//Let us have a H by W matrix 

  int H = 3 , W = 2;

  int blockSize = 4;

  int totalElements = H * W;

  int numBlocks = (! (totalElements % blockSize) )? (totalElements / blockSize): 1 + (totalElements / blockSize);

  size_t pitch_a, pitch_c , pitch_r , pitch_sum;

dim3 dimGrid(numBlocks, 1);

  dim3 dimBlock(blockSize,1);

// int shared_mem = W * sizeof(float);

//allocate memory

  a_h = (float**)malloc(H * sizeof(float*) );

  b_h = (float**)malloc(H * sizeof(float*) );

  for( i = 0; i < H; i++)

  {

	  a_h[i] = (float*)malloc(sizeof(float) * W);

	  memset(a_h[i] , 0 , W * sizeof(float) );

	  b_h[i] = (float*)malloc(sizeof(float) * W);

	  memset(b_h[i] , 0 , W * sizeof(float) );

  }

h_summation = (int*)malloc(sizeof(float) * W);

  memset(h_summation , 0 , sizeof(float) * W);

Error = cudaMallocPitch( (void**)& a_d , &pitch_a, W * sizeof(float) , H);

  Error = cudaMallocPitch( (void**)& r_d , &pitch_r, W * sizeof(float) , H);  

  Error = cudaMallocPitch( (void**)& c_d , &pitch_c, W * sizeof(float) , H);

Error = cudaMalloc( (void**)& sum_d, W * sizeof(int) );

//generate data

  for( i = 0; i < H; i++)

  {

	for( j = 0; j < W; j++)

		a_h[i][j] = a_h[i][j] = (float)(i*W + j);

  }

//copy source data to destination at device 

  Error = cudaMemcpy2D(a_d , pitch_a, a_h, W * sizeof(float) , W * sizeof(float) , H , cudaMemcpyHostToDevice );

	//call kernel to retrieve data 

  Sum<<<dimGrid, dimBlock>>>(a_d, c_d, r_d, sum_d, blockSize, pitch_a, totalElements, H , W);

//retrieve data from device

  //Error = cudaMemcpy2D(b_h, W * sizeof(float) , r_d ,  pitch_r , W * sizeof(float) , H , cudaMemcpyDeviceToHost );

  Error = cudaMemcpy2D(b_h, pitch_r , r_d ,  pitch_r , pitch_r , H , cudaMemcpyDeviceToHost );

//display data -- Original data

  for( i = 0; i < H; i++)

  {

	for( j = 0; j < W; j++)

		printf("a_h[%d][%d] = %f\n", i , j , a_h[i][j]);

  }

  printf("\n\n");

//retrieved data 

  for( i = 0; i < H; i++)

  {

	for( j = 0; j < W; j++)

		printf("b_h[%d][%d] = %f\n", i , j , b_h[i][j]);

  }

printf("\n\n");

//print sum per block

  for( j = 0; j < W; j++)

		printf("h_summation[%d] = %d\n", j , h_summation[j]);

//free memory

  cudaFree(a_d);

  cudaFree(r_d);

  cudaFree(sum_d);

for( i = 0; i < H; i++)

  {

	 free(a_h[i]);

	 a_h[i] = NULL;

		

	 free(b_h[i]);

	 b_h[i] = NULL;

  }

}//end main()

cirus · July 29, 2009, 4:47pm

Strange.

I replaced following

//for sending in data to device

Error = cudaMemcpy2D(a_d , pitch_a, a_h, W * sizeof(float) , W * sizeof(float) , H , cudaMemcpyHostToDevice );

//for retrieval from device

Error = cudaMemcpy2D(b_h, W * sizeof(float) , r_d ,  pitch_r , W * sizeof(float) , H , cudaMemcpyDeviceToHost );

where pitch_a = 64 = pitch_r

by following

//for sending in data to device

Error = cudaMemcpy2D(a_d , W * sizeof(float), a_h, W * sizeof(float) , W * sizeof(float) , H , cudaMemcpyHostToDevice ); 

and 

//for retrieval from device

Error = cudaMemcpy2D(b_h, W * sizeof(float) , r_d ,  W * sizeof(float) , W * sizeof(float) , H , cudaMemcpyDeviceToHost );

Now the code seems to be working correctly as far as results are concerned.

But in my kernel function I am using pitch value to access elements.

Q1. So why was it not working when I was using pitch value in cudaMemcoy2D ???

cirus · July 30, 2009, 3:53am

Update: With reference to above post, the program gives bizarre results when matrix size is increased say 10 * 9 etc .

But I found a workout where I prepare data as 1D array , then use cudamaalocPitch() to place the data in 2D format, do processing and then retrieve data back as 1D array.
This is working for all sizes.

Q1. Why does the program give bizarre results when data on host is in 2D format , copied to device using cudaMallocPitch and then retrieved??

foontala · August 12, 2010, 5:03am

Did you ever figure this out? I think I’m running into exactly the same problem!

Or, did you just keep on preparing the data as 1D array and then cudamallocpitch()? How exactly did you do that if you don’t mind me bringing up old news. This problem has been driving me crazy and I think you have the same issue! Thanks for any help!

foontala · August 12, 2010, 5:03am

Did you ever figure this out? I think I’m running into exactly the same problem!

Or, did you just keep on preparing the data as 1D array and then cudamallocpitch()? How exactly did you do that if you don’t mind me bringing up old news. This problem has been driving me crazy and I think you have the same issue! Thanks for any help!

Topic		Replies	Views
test on 'cudaMallocPitch' and 'cudaMemcpy2D' CUDA Programming and Performance	1	612	November 16, 2010
problem with cudaMallocPitch and cudaMemcpy2D CUDA Programming and Performance	5	6418	April 22, 2009
need help for cudaMemcpy2D() CUDA Programming and Performance	5	4632	December 8, 2009
Copying 2D array from host to device CUDA Programming and Performance	7	7356	July 27, 2010
2D array & Memory space Mostly about cudaMallocPitch & cudaMemcpy2D CUDA Programming and Performance	1	1518	October 15, 2009
cudaMemcpy2D help CUDA Programming and Performance	4	10660	July 28, 2009
CUDA 2D Array Problem Need help to manipulate 2D arrays in CUDA CUDA Programming and Performance	4	26519	March 17, 2011
Using cudaMemcpy2D very strange CUDA Programming and Performance	2	1405	March 10, 2009
trouble with cudaMemcpy2D I cant get a matrix to copy into 2D pitched memory CUDA Programming and Performance	1	956	July 13, 2009
Can't get copyDeviceToHost to work with cudaMemcpy2D CUDA Programming and Performance	0	3648	November 13, 2009

problem with cudaMallocPitch and cudaMemcpy2D

Related topics