Pointers array


How I can create array(int** mas) on Device and copy data in this array.

I’m try allocate memory:

size_t pitch=0;

int** devPtr = NULL;








But after pitch contains value = 10. Why?

Please, give me simple <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

I second the request, please give us a simple example

Pitch is set by the cudaMallocPitch call, it isn’t something you set yourself. CUDA wants memory aligned on certain boundries, in this case probably 4 bytes. So cudaMallocPitch sets the width of the array that it allocated to properly align your allocation. This can be different than the width of the array that you specify you want. You also don’t want to use an int** just a int* I believe.

Here is an example:

size_t pitch;

int* devptr = NULL;

cudaMallocPitch((void**)&devptr, &pitch, 10, 2);

You can then use cudaMemcpy2D or cudaMemcpy to copy data into devptr. Use 11 instead of 10 and see what you get back for pitch.

Hope that helps.

How would it be a 2D array if you use int* instead of int** ?

and how would you copy the array onto the GPU ?

do you just use

(continuation from the above code)

cudaMemcpy2D(devptr, 0 ,hostptr, 0 , 10 ,2 , cudaMemcpyHostToDevice);

and when you call it in the kernel

do you just use


if kernel is



Well, all the malloc calls just allocate linear memory. There is likely a reasonably good way to allocate a real C type 2D array using them but it requires some extra setup on your part. Really you can treat a 1D array like a 2D array using the appropriate addressing. Here is an example:

int A[20];
int B[10][2];

B[2][1] = A[2*2 + 1];

In this case A references the same location that B addresses. What you probably want is to allocate an array of pointers first. Then allocate a large array on the first pointer. Then go through all the other pointers and set them to the appropriate address in the first array. But generally it’s better to handle the 2D array as a 1D array and do the address calculation yourself. I also don’t think that CUDA gives you access to the pointers so creating a true dynamically created 2D array in the device memory would be impossible.

As for how would you copy data into the array, you will want to use cudaMemcpy or cudaMemcpy2D. As you can see they both take void*.

Hope that clears things up. I am no expert here so maybe there is a way to do what you want but I don’t know how you would do it.


I am facing a little trouble in accessing elements of my 2D array in GPU.


I am referring to following code from the official guide , but with my additions

//Host Code

float *d_src , *d2_src;

size_t pitch;

dim3 dimGrid(2,2);  //2D array of blocks.

dim3 dimBlock(4,1);  //Every block has 4 threads. In way I have 2 by 2 grid with each element as vector of length 4 float elements

N  = 16;

width = blockSize = 4;

height = N/blockSize;   //= 4 blocks

float **h2_src = NULL, h_src = NULL;

//Host pointer mem alloc

h_src = (float**)malloc(sizeof(float*) * height);

h2_src = (float**)malloc(sizeof(float*) * height);

for(int i = 0; i < height; i++)


   h_src[i] = (float)malloc(sizeof(float) * width);

   memset( h_src[i] , 0 , width * sizeof(float) );

h2_src[i] = (float)malloc(sizeof(float) * width);

   memset( h2_src[i] , 0 , width * sizeof(float) );


//Generate data. input is a 2D array of form h_src[4][4];

for(int i = 0; i < height; i++)


	for(int j = 0; j < width; j++)


		  h_src[i*width + j] = i * width + j;



//input[][] = { 0 , 1, 2, .... 15 }

//Memory Alloc at device

cudaMallocPitch( (void**) &d_src , &pitch , width * sizeof(float) , height);   //At debugging , the value of pitch is 64.

cudaMallocPitch( (void**) &d2_src , &pitch , width * sizeof(float) , height);   //At debugging , the value of pitch is 64.

cudaMemcpy2D( d_src, pitch , h_src , pitch , width * sizeof(float) , height ,  cudaMemcpyHostToDevice );

//GPU call

myKernel<<<dimGrid , dimBlock>>>(input , output, pitch, numBlocks, N , blockSize);

cudaMemcpy2D( h2_src, pitch , d2_src , pitch , width * sizeof(float) , height ,  cudaMemcpyHostToDevice ); 

//display all elements copied back from d2_src to h2_src

My kernel function

__global__ void myKernel(float* d_src ,  float* d2_src , size_t pitch , int N , int blockSize)


	float *row;

	  for(int j = 0; j < 2; j++)	// Here 2 is my grid height that is 2 by 2.		


		   row = (float*)( (char*)d_src + j * pitch );

		   for(int i = 0; i < 2; i++)


				d2_src[j*pitch + i] = row[i];




The result I am getting in h2_src is 0,1,2,3, 0…0 [trailing zeros].

I expected to get the original sequence viz <0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15>

What I guess after reading the pitch value as 64 is that it is taking 64/4 = 16 elements in a row.

Q1. Does this mean that CUDA is taking my 2 by 2 grid as a 1 by 4 grid ??? i.e elements 0…15 of the d_src is being treated as linear array? ?

Q2. Where else the problem can be? Is it in copying my host allocated source array to device allocated source array??

I am learning CUDA so need little help. Thanks.