2d array testing in very simple code using CUDA

in the very, very simple code that i am testing for 2d arrays using CUDA, i am getting strange numbers…
i know 1d array structure is better but please let me know if any part is not correct…
many thanks in advance.

====================================

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define X 3
#define Y 3

global void MatAdd(int **c)
{
int i = threadIdx.x;
int j = threadIdx.y;
c[i][j] = 1;

}

int main(int argc, char argv[])
{
int i, j;
int
* crray2d = (int**)malloc(X * sizeof(int*));
int** c_d = (int**)malloc(X * sizeof(int*));

  for(i=0; i<X; i++)  
  {
     crray2d[i] = (int*)malloc(Y * sizeof(int));
     cudaMemcpy(c_d, crray2d, Y * sizeof(int*), cudaMemcpyHostToDevice); 
  } 

  dim3 dimBlock(3,3);
  MatAdd<<<1,dimBlock>>>(c_d); 

  for(i=0;i < X;i++){
     cudaMemcpy(crray2d, c_d, Y * sizeof(int*), cudaMemcpyDeviceToHost) ; 
  }
  
  for(i=0; i<X; i++)
    for(j=0; j<Y; j++)
      printf("%d ", crray2d[i][j]);

  return 0;

}

in the very, very simple code that i am testing for 2d arrays using CUDA, i am getting strange numbers…
i know 1d array structure is better but please let me know if any part is not correct…
many thanks in advance.

====================================

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define X 3
#define Y 3

global void MatAdd(int **c)
{
int i = threadIdx.x;
int j = threadIdx.y;
c[i][j] = 1;

}

int main(int argc, char argv[])
{
int i, j;
int
* crray2d = (int**)malloc(X * sizeof(int*));
int** c_d = (int**)malloc(X * sizeof(int*));

  for(i=0; i<X; i++)  
  {
     crray2d[i] = (int*)malloc(Y * sizeof(int));
     cudaMemcpy(c_d, crray2d, Y * sizeof(int*), cudaMemcpyHostToDevice); 
  } 

  dim3 dimBlock(3,3);
  MatAdd<<<1,dimBlock>>>(c_d); 

  for(i=0;i < X;i++){
     cudaMemcpy(crray2d, c_d, Y * sizeof(int*), cudaMemcpyDeviceToHost) ; 
  }
  
  for(i=0; i<X; i++)
    for(j=0; j<Y; j++)
      printf("%d ", crray2d[i][j]);

  return 0;

}

I see the following issues with the code:

  • global function is supposed to operated on a device-based, rather than host-based memory array. The latter should be allocated using cudaMalloc(…), not malloc(…).

  • Within the loop prior to invocation of the kernel you do allocate crray2d[i] on the host (which looks reasonable), but don’t bother allocating c_d[i] (on the device, of course, using cudaMalloc(…)).

  • The cudaMemcpy(…) within that same loop prior to kernel allocation should probably copy a chunk of memory from crray2d[i] into c_d[i].

  • The same as the item above applies to the loop that follows kernel invocation.

  • dimBlock should be defined in terms of X and Y, not hard-coded constants.

  • Must check for errors after kernel invocation and other cuda…(…) calls.

  • I consider freeing memory to be good programming style, even when this freeing precedes return from main. This makes code reusable in subroutines, not just main(…).

If I were you, I would try to get this code to work on the CPU with no CUDA involved first. Then I would replace the appropriate malloc(…)s with cudaMalloc(…)s and appropriate memcpy(…)s with cudaMemcpy(…)s. There seem too many errors in your code to get them fixed within fairly complex GPU environment.

Good luck.

I see the following issues with the code:

  • global function is supposed to operated on a device-based, rather than host-based memory array. The latter should be allocated using cudaMalloc(…), not malloc(…).

  • Within the loop prior to invocation of the kernel you do allocate crray2d[i] on the host (which looks reasonable), but don’t bother allocating c_d[i] (on the device, of course, using cudaMalloc(…)).

  • The cudaMemcpy(…) within that same loop prior to kernel allocation should probably copy a chunk of memory from crray2d[i] into c_d[i].

  • The same as the item above applies to the loop that follows kernel invocation.

  • dimBlock should be defined in terms of X and Y, not hard-coded constants.

  • Must check for errors after kernel invocation and other cuda…(…) calls.

  • I consider freeing memory to be good programming style, even when this freeing precedes return from main. This makes code reusable in subroutines, not just main(…).

If I were you, I would try to get this code to work on the CPU with no CUDA involved first. Then I would replace the appropriate malloc(…)s with cudaMalloc(…)s and appropriate memcpy(…)s with cudaMemcpy(…)s. There seem too many errors in your code to get them fixed within fairly complex GPU environment.

Good luck.

well, i revised the code and in the following, when imethod == 1 no problem but it causes a problem when imethod != 1.

i am testing whether i can get c_d from CUDA.

the way i’m preparing arrays is correct?

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define X 3
#define Y 4

global void MatAdd(int **c)
{
int i = threadIdx.x;
int j = threadIdx.y;
c[i][j] = 0;
}

void MatAdd_C(int** c, int num_rows, int num_cols)
{
for (int i=0; i < num_rows; i++)
for(int j=0; j < num_cols; j++)
c[i][j] = 1;
}

void cleanup(int** array, int x)
{
int i;

for(i=0; i<x; i++)
free(array[i]);
free(array);
}

int main(int argc, char *argv)
{
int i, j, method = 0;
int **crray2d;
int **c_d;

  crray2d = (int**)malloc(X*sizeof(int*));
  for(i=0; i<X; i++)
      crray2d[i] = (int*)malloc(Y*sizeof(int));

  if (method == 1)      
       MatAdd_C(crray2d, X, Y);

  else if (method == 0)
  {
       cudaMalloc((void **) &c_d, X*sizeof(int*));
       for(i=0; i<X; i++) 
         cudaMalloc((void **) &c_d[i], Y*sizeof(int));
  
       dim3 dimBlock(X,Y);
       MatAdd<<<1,dimBlock>>>(c_d);

       for(i=0;i < X;i++){
         cudaMemcpy(crray2d, c_d, Y*sizeof(int*), cudaMemcpyDeviceToHost) ; 
       }
  }

  for(i=0; i<X; i++)
  {
    for(j=0; j<Y; j++) printf("%d ", crray2d[i][j]); 
    printf("\n");
  }
  
  cleanup(crray2d, X);
        
  return 0;

}

well, i revised the code and in the following, when imethod == 1 no problem but it causes a problem when imethod != 1.

i am testing whether i can get c_d from CUDA.

the way i’m preparing arrays is correct?

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define X 3
#define Y 4

global void MatAdd(int **c)
{
int i = threadIdx.x;
int j = threadIdx.y;
c[i][j] = 0;
}

void MatAdd_C(int** c, int num_rows, int num_cols)
{
for (int i=0; i < num_rows; i++)
for(int j=0; j < num_cols; j++)
c[i][j] = 1;
}

void cleanup(int** array, int x)
{
int i;

for(i=0; i<x; i++)
free(array[i]);
free(array);
}

int main(int argc, char *argv)
{
int i, j, method = 0;
int **crray2d;
int **c_d;

  crray2d = (int**)malloc(X*sizeof(int*));
  for(i=0; i<X; i++)
      crray2d[i] = (int*)malloc(Y*sizeof(int));

  if (method == 1)      
       MatAdd_C(crray2d, X, Y);

  else if (method == 0)
  {
       cudaMalloc((void **) &c_d, X*sizeof(int*));
       for(i=0; i<X; i++) 
         cudaMalloc((void **) &c_d[i], Y*sizeof(int));
  
       dim3 dimBlock(X,Y);
       MatAdd<<<1,dimBlock>>>(c_d);

       for(i=0;i < X;i++){
         cudaMemcpy(crray2d, c_d, Y*sizeof(int*), cudaMemcpyDeviceToHost) ; 
       }
  }

  for(i=0; i<X; i++)
  {
    for(j=0; j<Y; j++) printf("%d ", crray2d[i][j]); 
    printf("\n");
  }
  
  cleanup(crray2d, X);
        
  return 0;

}

I cannot see the following items from my original implemented in the new code:

  • The same as the item above applies to the loop that follows kernel invocation (you copy pointers to data in the device memory, but do not copy the data itself). Note, that this copy would leak host memory, unless you free whatever is pointed at by crray2d[i].

  • Must check for errors after kernel invocation and other cuda…(…) calls.

  • I consider freeing memory to be good programming style, even when this freeing precedes return from main. This makes code reusable in subroutines, not just main(…) (device memory is not being freed).

In addition to the above, I don’t see the need for if (method == 0). Simple else should suffice in that place.

I cannot see the following items from my original implemented in the new code:

  • The same as the item above applies to the loop that follows kernel invocation (you copy pointers to data in the device memory, but do not copy the data itself). Note, that this copy would leak host memory, unless you free whatever is pointed at by crray2d[i].

  • Must check for errors after kernel invocation and other cuda…(…) calls.

  • I consider freeing memory to be good programming style, even when this freeing precedes return from main. This makes code reusable in subroutines, not just main(…) (device memory is not being freed).

In addition to the above, I don’t see the need for if (method == 0). Simple else should suffice in that place.

well… thanks for your reply again…

here i’d like to check to see if i am preparing array of pointers in device correctly.
so i think i don’t need to copy any data from host to device. (am i wrong on this?)

please see a note in the middle of the code.

i wonder why the following part is not working…

for(i=0; i<X; i++)
cutilSafeCall( cudaMalloc((void **) &c_d[i], Y*sizeof(int)) );

can i use cudaMallocArray for the memory allocation purpose for my case?

many thanks again in well advance.

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <cutil_inline.h>

#define X 3
#define Y 4

global void MatAdd(int **c)
{
int i = threadIdx.x;
int j = threadIdx.y;
c[i][j] = 0;
}

void MatAdd_C(int** c, int num_rows, int num_cols)
{
for (int i=0; i < num_rows; i++)
for(int j=0; j < num_cols; j++)
c[i][j] = 0;
}

void cleanup(int** array, int x)
{
int i;

for(i=0; i<x; i++)
free(array[i]);
free(array);
}

int main(int argc, char *argv)
{
int i, j, method = 0;
int **crray2d;
int **c_d;

  crray2d = (int**)malloc(X*sizeof(int*));
  for(i=0; i<X; i++)
      crray2d[i] = (int*)malloc(Y*sizeof(int));

  if (method == 1)      
       MatAdd_C(crray2d, X, Y);

  else
  {    
       cutilSafeCall( cudaMalloc((void **) &c_d, X*sizeof(int*)) );
       //for(i=0; i<X; i++) 
       //    cutilSafeCall( cudaMalloc((void **) &c_d[i], Y*sizeof(int)) );
  
       /* i noticed that i can pass the first cudaMalloc */
       /* but i can not pass the for loop cudaMalloc     */
       /*  therefore below is not matter at this point   */

       //dim3 dimBlock(X,Y);
       //MatAdd<<<1,dimBlock>>>(c_d);

       //for(i=0;i < X;i++){
       //    cudaMemcpy(crray2d, c_d, Y*sizeof(int*), cudaMemcpyDeviceToHost) ; 
       //}
  }

  for(i=0; i<X; i++)
  {
    for(j=0; j<Y; j++) printf("%d ", crray2d[i][j]); 
    printf("\n");
  }
  
  cleanup(crray2d, X);
  cudaFree(c_d);

}

well… thanks for your reply again…

here i’d like to check to see if i am preparing array of pointers in device correctly.
so i think i don’t need to copy any data from host to device. (am i wrong on this?)

please see a note in the middle of the code.

i wonder why the following part is not working…

for(i=0; i<X; i++)
cutilSafeCall( cudaMalloc((void **) &c_d[i], Y*sizeof(int)) );

can i use cudaMallocArray for the memory allocation purpose for my case?

many thanks again in well advance.

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <cutil_inline.h>

#define X 3
#define Y 4

global void MatAdd(int **c)
{
int i = threadIdx.x;
int j = threadIdx.y;
c[i][j] = 0;
}

void MatAdd_C(int** c, int num_rows, int num_cols)
{
for (int i=0; i < num_rows; i++)
for(int j=0; j < num_cols; j++)
c[i][j] = 0;
}

void cleanup(int** array, int x)
{
int i;

for(i=0; i<x; i++)
free(array[i]);
free(array);
}

int main(int argc, char *argv)
{
int i, j, method = 0;
int **crray2d;
int **c_d;

  crray2d = (int**)malloc(X*sizeof(int*));
  for(i=0; i<X; i++)
      crray2d[i] = (int*)malloc(Y*sizeof(int));

  if (method == 1)      
       MatAdd_C(crray2d, X, Y);

  else
  {    
       cutilSafeCall( cudaMalloc((void **) &c_d, X*sizeof(int*)) );
       //for(i=0; i<X; i++) 
       //    cutilSafeCall( cudaMalloc((void **) &c_d[i], Y*sizeof(int)) );
  
       /* i noticed that i can pass the first cudaMalloc */
       /* but i can not pass the for loop cudaMalloc     */
       /*  therefore below is not matter at this point   */

       //dim3 dimBlock(X,Y);
       //MatAdd<<<1,dimBlock>>>(c_d);

       //for(i=0;i < X;i++){
       //    cudaMemcpy(crray2d, c_d, Y*sizeof(int*), cudaMemcpyDeviceToHost) ; 
       //}
  }

  for(i=0; i<X; i++)
  {
    for(j=0; j<Y; j++) printf("%d ", crray2d[i][j]); 
    printf("\n");
  }
  
  cleanup(crray2d, X);
  cudaFree(c_d);

}

When you do cudaMalloc((void **) &c_d[i], Y*sizeof(int)), the code that runs on the CPU (NOT GPU) allocates the memory within the device space and attempts to write the pointer to the newly allocated memory into the device memory space c_d[i], which has been previously allocated by another cudaMalloc. This memory c_d[i] does NOT exist within the process space on the host at all (had I mentioned it was within the device space?), and hence this write would most likely cause a segfault in the host code. You want to write the result of your cudaMalloc into the host array crray2d[i] instead of c_d[i] and then copy that whole array of device pointers crray2d to device array c_d using cudaMemcpy (you don’t have this copy coded yet) prior to invoking the kernel.

After and if the kernel has successfully returned (cudaThreadSynchronize(…), cudaGetLastError(…)), allocate each crray2d[i] ON THE HOST (regular malloc()) and copy from whatever crray2d[i] has been pointing at (PRIOR TO THIS re-initialization) on the device. You’ll probably need one temp integer pointer for this operation.

Just place a stack of pencils to the left of you on the desk, each pencil illustrating an 1d array on host, then move the pencils to the right (device memory), then back, as you go through you code. Make sure the appropriate malloc() or cudaMalloc(…) in the code precedes the move and cudaFree(…) and free(…) follows it, when needed.

When you do cudaMalloc((void **) &c_d[i], Y*sizeof(int)), the code that runs on the CPU (NOT GPU) allocates the memory within the device space and attempts to write the pointer to the newly allocated memory into the device memory space c_d[i], which has been previously allocated by another cudaMalloc. This memory c_d[i] does NOT exist within the process space on the host at all (had I mentioned it was within the device space?), and hence this write would most likely cause a segfault in the host code. You want to write the result of your cudaMalloc into the host array crray2d[i] instead of c_d[i] and then copy that whole array of device pointers crray2d to device array c_d using cudaMemcpy (you don’t have this copy coded yet) prior to invoking the kernel.

After and if the kernel has successfully returned (cudaThreadSynchronize(…), cudaGetLastError(…)), allocate each crray2d[i] ON THE HOST (regular malloc()) and copy from whatever crray2d[i] has been pointing at (PRIOR TO THIS re-initialization) on the device. You’ll probably need one temp integer pointer for this operation.

Just place a stack of pencils to the left of you on the desk, each pencil illustrating an 1d array on host, then move the pencils to the right (device memory), then back, as you go through you code. Make sure the appropriate malloc() or cudaMalloc(…) in the code precedes the move and cudaFree(…) and free(…) follows it, when needed.

[quote name=‘cudesnick’ date=‘09 November 2010 - 07:36 PM’ timestamp=‘1289360184’ post=‘1144254’]

When you do cudaMalloc((void **) &c_d[i], Y*sizeof(int)), the code that runs on the CPU (NOT GPU) allocates the memory within the device space and attempts to write the pointer to the newly allocated memory into the device memory space c_d[i], which has been previously allocated by another cudaMalloc. This memory c_d[i] does NOT exist within the process space on the host at all (had I mentioned it was within the device space?), and hence this write would most likely cause a segfault in the host code. You want to write the result of your cudaMalloc into the host array crray2d[i] instead of c_d[i] and then copy that whole array of device pointers crray2d to device array c_d using cudaMemcpy (you don’t have this copy coded yet) prior to invoking the kernel.

======================================

Well. thanks again. So much of insightful comments and i am so glad i find this site and your comments…

please bare with me for little longer i think i am getting there…

Since i may need crray2 later usage and you mention a temp array, i coded the following using a temp array.

since you know whats going on here so i skipped before and after…

int *c_d_host; / temp array */

CUDA_SAFE_CALL( cudaMalloc((void **) &c_d, Xsizeof(int)) );

c_d_host = (int**)malloc(Xsizeof(int)); /* assign temp */

for(i=0; i<X; i++)

   CUDA_SAFE_CALL( cudaMalloc((void **) &c_d_host[i], Y*sizeof(int)) );   

for(i=0; i<X; i++)

   CUDA_SAFE_CALL( cudaMemcpy(c_d[i], c_d_host[i], Y*sizeof(int*), cudaMemcpyHostToDevice) );

the code could not pass the last copying part.

[quote name=‘cudesnick’ date=‘09 November 2010 - 07:36 PM’ timestamp=‘1289360184’ post=‘1144254’]

When you do cudaMalloc((void **) &c_d[i], Y*sizeof(int)), the code that runs on the CPU (NOT GPU) allocates the memory within the device space and attempts to write the pointer to the newly allocated memory into the device memory space c_d[i], which has been previously allocated by another cudaMalloc. This memory c_d[i] does NOT exist within the process space on the host at all (had I mentioned it was within the device space?), and hence this write would most likely cause a segfault in the host code. You want to write the result of your cudaMalloc into the host array crray2d[i] instead of c_d[i] and then copy that whole array of device pointers crray2d to device array c_d using cudaMemcpy (you don’t have this copy coded yet) prior to invoking the kernel.

======================================

Well. thanks again. So much of insightful comments and i am so glad i find this site and your comments…

please bare with me for little longer i think i am getting there…

Since i may need crray2 later usage and you mention a temp array, i coded the following using a temp array.

since you know whats going on here so i skipped before and after…

int *c_d_host; / temp array */

CUDA_SAFE_CALL( cudaMalloc((void **) &c_d, Xsizeof(int)) );

c_d_host = (int**)malloc(Xsizeof(int)); /* assign temp */

for(i=0; i<X; i++)

   CUDA_SAFE_CALL( cudaMalloc((void **) &c_d_host[i], Y*sizeof(int)) );   

for(i=0; i<X; i++)

   CUDA_SAFE_CALL( cudaMemcpy(c_d[i], c_d_host[i], Y*sizeof(int*), cudaMemcpyHostToDevice) );

the code could not pass the last copying part.

[quote name=‘syoon’ date=‘10 November 2010 - 02:14 PM’ timestamp=‘1289412873’ post=‘1144540’]

I don’t understand this piece of code. In my previous posting I was mentioning the temp variable (you don’t need a whole array, but that should work too) w.r.t copying the result data from the device to the host, after the kernel has completed. The last line in your new piece of code clearly attempts to copy from host to device, which I don’t find relevant to my previous posting.

[quote name=‘syoon’ date=‘10 November 2010 - 02:14 PM’ timestamp=‘1289412873’ post=‘1144540’]

I don’t understand this piece of code. In my previous posting I was mentioning the temp variable (you don’t need a whole array, but that should work too) w.r.t copying the result data from the device to the host, after the kernel has completed. The last line in your new piece of code clearly attempts to copy from host to device, which I don’t find relevant to my previous posting.

In previous advice,

“You want to write the result of your cudaMalloc into the host array crray2d[i] instead of c_d[i] and then copy that whole array of device pointers crray2d to device array c_d using cudaMemcpy prior to invoking the kernel.”

Now I figured I may need crray2d for later use, I did cudaMalloc using ‘c_d_host’ and cudaMemcpy ‘c_d_host’ to c_d…

did i misunderstood your comments?

Thanks again.

In previous advice,

“You want to write the result of your cudaMalloc into the host array crray2d[i] instead of c_d[i] and then copy that whole array of device pointers crray2d to device array c_d using cudaMemcpy prior to invoking the kernel.”

Now I figured I may need crray2d for later use, I did cudaMalloc using ‘c_d_host’ and cudaMemcpy ‘c_d_host’ to c_d…

did i misunderstood your comments?

Thanks again.

You shouldn’t have allocated anything on the host into elements of crray2d prior to kernel invocation: you should have only allocated device arrays into elements of this host array and hence should have initialized crray2d[i] with the device pointers. Obviously, you should have allocated the array crray2d itself on the host.

After the kernel has successfully returned, you should allocate HOST arrays into crray2d[i] and copy device arrays into these host arrays. You must be careful within this step to not loose pointers to device arrays stored within crray2d[i] prior to the allocation of the host arrays hence the use of a temp pointer variable. Once the contents of a device 1d array has been copied to the corresponding host 1d array, that device 1d array should be immediately freed in order to not leak the device RAM.

You shouldn’t have allocated anything on the host into elements of crray2d prior to kernel invocation: you should have only allocated device arrays into elements of this host array and hence should have initialized crray2d[i] with the device pointers. Obviously, you should have allocated the array crray2d itself on the host.

After the kernel has successfully returned, you should allocate HOST arrays into crray2d[i] and copy device arrays into these host arrays. You must be careful within this step to not loose pointers to device arrays stored within crray2d[i] prior to the allocation of the host arrays hence the use of a temp pointer variable. Once the contents of a device 1d array has been copied to the corresponding host 1d array, that device 1d array should be immediately freed in order to not leak the device RAM.