2d array testing in very simple code using CUDA

syoon · November 9, 2010, 2:06am

in the very, very simple code that i am testing for 2d arrays using CUDA, i am getting strange numbers…
i know 1d array structure is better but please let me know if any part is not correct…
many thanks in advance.

====================================

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define X 3
#define Y 3

global void MatAdd(int **c)
{
int i = threadIdx.x;
int j = threadIdx.y;
c[i][j] = 1;

}

int main(int argc, char argv[])
{
int i, j;
int* crray2d = (int**)malloc(X * sizeof(int*));
int** c_d = (int**)malloc(X * sizeof(int*));

  for(i=0; i<X; i++)  
  {
     crray2d[i] = (int*)malloc(Y * sizeof(int));
     cudaMemcpy(c_d, crray2d, Y * sizeof(int*), cudaMemcpyHostToDevice); 
  } 

  dim3 dimBlock(3,3);
  MatAdd<<<1,dimBlock>>>(c_d); 

  for(i=0;i < X;i++){
     cudaMemcpy(crray2d, c_d, Y * sizeof(int*), cudaMemcpyDeviceToHost) ; 
  }
  
  for(i=0; i<X; i++)
    for(j=0; j<Y; j++)
      printf("%d ", crray2d[i][j]);

  return 0;

}

syoon · November 9, 2010, 2:06am

in the very, very simple code that i am testing for 2d arrays using CUDA, i am getting strange numbers…
i know 1d array structure is better but please let me know if any part is not correct…
many thanks in advance.

====================================

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define X 3
#define Y 3

global void MatAdd(int **c)
{
int i = threadIdx.x;
int j = threadIdx.y;
c[i][j] = 1;

}

int main(int argc, char argv[])
{
int i, j;
int* crray2d = (int**)malloc(X * sizeof(int*));
int** c_d = (int**)malloc(X * sizeof(int*));

  for(i=0; i<X; i++)  
  {
     crray2d[i] = (int*)malloc(Y * sizeof(int));
     cudaMemcpy(c_d, crray2d, Y * sizeof(int*), cudaMemcpyHostToDevice); 
  } 

  dim3 dimBlock(3,3);
  MatAdd<<<1,dimBlock>>>(c_d); 

  for(i=0;i < X;i++){
     cudaMemcpy(crray2d, c_d, Y * sizeof(int*), cudaMemcpyDeviceToHost) ; 
  }
  
  for(i=0; i<X; i++)
    for(j=0; j<Y; j++)
      printf("%d ", crray2d[i][j]);

  return 0;

}

cudesnick · November 9, 2010, 3:06am

in the very, very simple code that i am testing for 2d arrays using CUDA, i am getting strange numbers…

i know 1d array structure is better but please let me know if any part is not correct…

many thanks in advance.

====================================

include <stdlib.h>

include <string.h>

include <stdio.h>

define X 3

define Y 3

global void MatAdd(int **c)

{

int i = threadIdx.x;

int j = threadIdx.y;

c[i][j] = 1;

}

int main(int argc, char *argv)

{
  int i, j;

  int** crray2d = (int**)malloc(X * sizeof(int*));

  int** c_d = (int**)malloc(X * sizeof(int*));  
for(i=0; i<X; i++)
  {

     crray2d[i] = (int*)malloc(Y * sizeof(int));

     cudaMemcpy(c_d, crray2d, Y * sizeof(int*), cudaMemcpyHostToDevice); 

  } 
dim3 dimBlock(3,3);
  MatAdd<<<1,dimBlock>>>(c_d); 
for(i=0;i < X;i++){
     cudaMemcpy(crray2d, c_d, Y * sizeof(int*), cudaMemcpyDeviceToHost) ; 

  }
for(i=0; i<X; i++)
    for(j=0; j<Y; j++)

      printf("%d ", crray2d[i][j]);
return 0;

}

I see the following issues with the code:

global function is supposed to operated on a device-based, rather than host-based memory array. The latter should be allocated using cudaMalloc(…), not malloc(…).
Within the loop prior to invocation of the kernel you do allocate crray2d[i] on the host (which looks reasonable), but don’t bother allocating c_d[i] (on the device, of course, using cudaMalloc(…)).
The cudaMemcpy(…) within that same loop prior to kernel allocation should probably copy a chunk of memory from crray2d[i] into c_d[i].
The same as the item above applies to the loop that follows kernel invocation.
dimBlock should be defined in terms of X and Y, not hard-coded constants.
Must check for errors after kernel invocation and other cuda…(…) calls.
I consider freeing memory to be good programming style, even when this freeing precedes return from main. This makes code reusable in subroutines, not just main(…).

If I were you, I would try to get this code to work on the CPU with no CUDA involved first. Then I would replace the appropriate malloc(…)s with cudaMalloc(…)s and appropriate memcpy(…)s with cudaMemcpy(…)s. There seem too many errors in your code to get them fixed within fairly complex GPU environment.

Good luck.

cudesnick · November 9, 2010, 3:06am

in the very, very simple code that i am testing for 2d arrays using CUDA, i am getting strange numbers…

i know 1d array structure is better but please let me know if any part is not correct…

many thanks in advance.

====================================

include <stdlib.h>

include <string.h>

include <stdio.h>

define X 3

define Y 3

global void MatAdd(int **c)

{

int i = threadIdx.x;

int j = threadIdx.y;

c[i][j] = 1;

}

int main(int argc, char *argv)

{
  int i, j;

  int** crray2d = (int**)malloc(X * sizeof(int*));

  int** c_d = (int**)malloc(X * sizeof(int*));  
for(i=0; i<X; i++)
  {

     crray2d[i] = (int*)malloc(Y * sizeof(int));

     cudaMemcpy(c_d, crray2d, Y * sizeof(int*), cudaMemcpyHostToDevice); 

  } 
dim3 dimBlock(3,3);
  MatAdd<<<1,dimBlock>>>(c_d); 
for(i=0;i < X;i++){
     cudaMemcpy(crray2d, c_d, Y * sizeof(int*), cudaMemcpyDeviceToHost) ; 

  }
for(i=0; i<X; i++)
    for(j=0; j<Y; j++)

      printf("%d ", crray2d[i][j]);
return 0;

}

I see the following issues with the code:

global function is supposed to operated on a device-based, rather than host-based memory array. The latter should be allocated using cudaMalloc(…), not malloc(…).
Within the loop prior to invocation of the kernel you do allocate crray2d[i] on the host (which looks reasonable), but don’t bother allocating c_d[i] (on the device, of course, using cudaMalloc(…)).
The cudaMemcpy(…) within that same loop prior to kernel allocation should probably copy a chunk of memory from crray2d[i] into c_d[i].
The same as the item above applies to the loop that follows kernel invocation.
dimBlock should be defined in terms of X and Y, not hard-coded constants.
Must check for errors after kernel invocation and other cuda…(…) calls.
I consider freeing memory to be good programming style, even when this freeing precedes return from main. This makes code reusable in subroutines, not just main(…).

If I were you, I would try to get this code to work on the CPU with no CUDA involved first. Then I would replace the appropriate malloc(…)s with cudaMalloc(…)s and appropriate memcpy(…)s with cudaMemcpy(…)s. There seem too many errors in your code to get them fixed within fairly complex GPU environment.

Good luck.

syoon · November 9, 2010, 10:29pm

well, i revised the code and in the following, when imethod == 1 no problem but it causes a problem when imethod != 1.

i am testing whether i can get c_d from CUDA.

the way i’m preparing arrays is correct?

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define X 3
#define Y 4

global void MatAdd(int **c)
{
int i = threadIdx.x;
int j = threadIdx.y;
c[i][j] = 0;
}

void MatAdd_C(int** c, int num_rows, int num_cols)
{
for (int i=0; i < num_rows; i++)
for(int j=0; j < num_cols; j++)
c[i][j] = 1;
}

void cleanup(int** array, int x)
{
int i;

for(i=0; i<x; i++)
free(array[i]);
free(array);
}

int main(int argc, char *argv)
{
int i, j, method = 0;
int **crray2d;
int **c_d;

  crray2d = (int**)malloc(X*sizeof(int*));
  for(i=0; i<X; i++)
      crray2d[i] = (int*)malloc(Y*sizeof(int));

  if (method == 1)      
       MatAdd_C(crray2d, X, Y);

  else if (method == 0)
  {
       cudaMalloc((void **) &c_d, X*sizeof(int*));
       for(i=0; i<X; i++) 
         cudaMalloc((void **) &c_d[i], Y*sizeof(int));
  
       dim3 dimBlock(X,Y);
       MatAdd<<<1,dimBlock>>>(c_d);

       for(i=0;i < X;i++){
         cudaMemcpy(crray2d, c_d, Y*sizeof(int*), cudaMemcpyDeviceToHost) ; 
       }
  }

  for(i=0; i<X; i++)
  {
    for(j=0; j<Y; j++) printf("%d ", crray2d[i][j]); 
    printf("\n");
  }
  
  cleanup(crray2d, X);
        
  return 0;

}

syoon · November 9, 2010, 10:29pm

well, i revised the code and in the following, when imethod == 1 no problem but it causes a problem when imethod != 1.

i am testing whether i can get c_d from CUDA.

the way i’m preparing arrays is correct?

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define X 3
#define Y 4

global void MatAdd(int **c)
{
int i = threadIdx.x;
int j = threadIdx.y;
c[i][j] = 0;
}

void MatAdd_C(int** c, int num_rows, int num_cols)
{
for (int i=0; i < num_rows; i++)
for(int j=0; j < num_cols; j++)
c[i][j] = 1;
}

void cleanup(int** array, int x)
{
int i;

for(i=0; i<x; i++)
free(array[i]);
free(array);
}

int main(int argc, char *argv)
{
int i, j, method = 0;
int **crray2d;
int **c_d;

  crray2d = (int**)malloc(X*sizeof(int*));
  for(i=0; i<X; i++)
      crray2d[i] = (int*)malloc(Y*sizeof(int));

  if (method == 1)      
       MatAdd_C(crray2d, X, Y);

  else if (method == 0)
  {
       cudaMalloc((void **) &c_d, X*sizeof(int*));
       for(i=0; i<X; i++) 
         cudaMalloc((void **) &c_d[i], Y*sizeof(int));
  
       dim3 dimBlock(X,Y);
       MatAdd<<<1,dimBlock>>>(c_d);

       for(i=0;i < X;i++){
         cudaMemcpy(crray2d, c_d, Y*sizeof(int*), cudaMemcpyDeviceToHost) ; 
       }
  }

  for(i=0; i<X; i++)
  {
    for(j=0; j<Y; j++) printf("%d ", crray2d[i][j]); 
    printf("\n");
  }
  
  cleanup(crray2d, X);
        
  return 0;

}

cudesnick · November 9, 2010, 11:53pm

well, i revised the code and in the following, when imethod == 1 no problem but it causes a problem when imethod != 1.

i am testing whether i can get c_d from CUDA.

the way i’m preparing arrays is correct?

include <stdlib.h>

include <string.h>

include <stdio.h>

define X 3

define Y 4

global void MatAdd(int **c)

{

int i = threadIdx.x;

int j = threadIdx.y;

c[i][j] = 0;

}

void MatAdd_C(int** c, int num_rows, int num_cols)

{

for (int i=0; i < num_rows; i++)
for(int j=0; j < num_cols; j++)

  c[i][j] = 1;
}

void cleanup(int** array, int x)

{

int i;

for(i=0; i<x; i++)
  free(array[i]);
free(array);

}

int main(int argc, char *argv)

{
  int i, j, method = 0;

  int **crray2d;  

  int **c_d;
crray2d = (int**)malloc(Xsizeof(int));
  for(i=0; i<X; i++)

      crray2d[i] = (int*)malloc(Y*sizeof(int));
if (method == 1)
       MatAdd_C(crray2d, X, Y);
else if (method == 0)
  {

       cudaMalloc((void **) &c_d, X*sizeof(int*));

       for(i=0; i<X; i++) 

         cudaMalloc((void **) &c_d[i], Y*sizeof(int));
dim3 dimBlock(X,Y);
       MatAdd<<<1,dimBlock>>>(c_d);
for(i=0;i < X;i++){
         cudaMemcpy(crray2d, c_d, Y*sizeof(int*), cudaMemcpyDeviceToHost) ; 

       }

  }
for(i=0; i<X; i++)
  {

    for(j=0; j<Y; j++) printf("%d ", crray2d[i][j]); 

    printf("\n");

  }
cleanup(crray2d, X);

return 0;

}

I cannot see the following items from my original implemented in the new code:

The same as the item above applies to the loop that follows kernel invocation (you copy pointers to data in the device memory, but do not copy the data itself). Note, that this copy would leak host memory, unless you free whatever is pointed at by crray2d[i].
Must check for errors after kernel invocation and other cuda…(…) calls.
I consider freeing memory to be good programming style, even when this freeing precedes return from main. This makes code reusable in subroutines, not just main(…) (device memory is not being freed).

In addition to the above, I don’t see the need for if (method == 0). Simple else should suffice in that place.

cudesnick · November 9, 2010, 11:53pm

well, i revised the code and in the following, when imethod == 1 no problem but it causes a problem when imethod != 1.

i am testing whether i can get c_d from CUDA.

the way i’m preparing arrays is correct?

include <stdlib.h>

include <string.h>

include <stdio.h>

define X 3

define Y 4

global void MatAdd(int **c)

{

int i = threadIdx.x;

int j = threadIdx.y;

c[i][j] = 0;

}

void MatAdd_C(int** c, int num_rows, int num_cols)

{

for (int i=0; i < num_rows; i++)
for(int j=0; j < num_cols; j++)

  c[i][j] = 1;
}

void cleanup(int** array, int x)

{

int i;

for(i=0; i<x; i++)
  free(array[i]);
free(array);

}

int main(int argc, char *argv)

{
  int i, j, method = 0;

  int **crray2d;  

  int **c_d;
crray2d = (int**)malloc(Xsizeof(int));
  for(i=0; i<X; i++)

      crray2d[i] = (int*)malloc(Y*sizeof(int));
if (method == 1)
       MatAdd_C(crray2d, X, Y);
else if (method == 0)
  {

       cudaMalloc((void **) &c_d, X*sizeof(int*));

       for(i=0; i<X; i++) 

         cudaMalloc((void **) &c_d[i], Y*sizeof(int));
dim3 dimBlock(X,Y);
       MatAdd<<<1,dimBlock>>>(c_d);
for(i=0;i < X;i++){
         cudaMemcpy(crray2d, c_d, Y*sizeof(int*), cudaMemcpyDeviceToHost) ; 

       }

  }
for(i=0; i<X; i++)
  {

    for(j=0; j<Y; j++) printf("%d ", crray2d[i][j]); 

    printf("\n");

  }
cleanup(crray2d, X);

return 0;

}

I cannot see the following items from my original implemented in the new code:

The same as the item above applies to the loop that follows kernel invocation (you copy pointers to data in the device memory, but do not copy the data itself). Note, that this copy would leak host memory, unless you free whatever is pointed at by crray2d[i].
Must check for errors after kernel invocation and other cuda…(…) calls.
I consider freeing memory to be good programming style, even when this freeing precedes return from main. This makes code reusable in subroutines, not just main(…) (device memory is not being freed).

In addition to the above, I don’t see the need for if (method == 0). Simple else should suffice in that place.

syoon · November 10, 2010, 1:31am

well… thanks for your reply again…

here i’d like to check to see if i am preparing array of pointers in device correctly.
so i think i don’t need to copy any data from host to device. (am i wrong on this?)

please see a note in the middle of the code.

i wonder why the following part is not working…

…

for(i=0; i<X; i++)
cutilSafeCall( cudaMalloc((void **) &c_d[i], Y*sizeof(int)) );
…

can i use cudaMallocArray for the memory allocation purpose for my case?

many thanks again in well advance.

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <cutil_inline.h>

#define X 3
#define Y 4

global void MatAdd(int **c)
{
int i = threadIdx.x;
int j = threadIdx.y;
c[i][j] = 0;
}

void MatAdd_C(int** c, int num_rows, int num_cols)
{
for (int i=0; i < num_rows; i++)
for(int j=0; j < num_cols; j++)
c[i][j] = 0;
}

void cleanup(int** array, int x)
{
int i;

for(i=0; i<x; i++)
free(array[i]);
free(array);
}

int main(int argc, char *argv)
{
int i, j, method = 0;
int **crray2d;
int **c_d;

  crray2d = (int**)malloc(X*sizeof(int*));
  for(i=0; i<X; i++)
      crray2d[i] = (int*)malloc(Y*sizeof(int));

  if (method == 1)      
       MatAdd_C(crray2d, X, Y);

  else
  {    
       cutilSafeCall( cudaMalloc((void **) &c_d, X*sizeof(int*)) );
       //for(i=0; i<X; i++) 
       //    cutilSafeCall( cudaMalloc((void **) &c_d[i], Y*sizeof(int)) );
  
       /* i noticed that i can pass the first cudaMalloc */
       /* but i can not pass the for loop cudaMalloc     */
       /*  therefore below is not matter at this point   */

       //dim3 dimBlock(X,Y);
       //MatAdd<<<1,dimBlock>>>(c_d);

       //for(i=0;i < X;i++){
       //    cudaMemcpy(crray2d, c_d, Y*sizeof(int*), cudaMemcpyDeviceToHost) ; 
       //}
  }

  for(i=0; i<X; i++)
  {
    for(j=0; j<Y; j++) printf("%d ", crray2d[i][j]); 
    printf("\n");
  }
  
  cleanup(crray2d, X);
  cudaFree(c_d);

}

syoon · November 10, 2010, 1:31am

well… thanks for your reply again…

here i’d like to check to see if i am preparing array of pointers in device correctly.
so i think i don’t need to copy any data from host to device. (am i wrong on this?)

please see a note in the middle of the code.

i wonder why the following part is not working…

…

for(i=0; i<X; i++)
cutilSafeCall( cudaMalloc((void **) &c_d[i], Y*sizeof(int)) );
…

can i use cudaMallocArray for the memory allocation purpose for my case?

many thanks again in well advance.

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <cutil_inline.h>

#define X 3
#define Y 4

global void MatAdd(int **c)
{
int i = threadIdx.x;
int j = threadIdx.y;
c[i][j] = 0;
}

void MatAdd_C(int** c, int num_rows, int num_cols)
{
for (int i=0; i < num_rows; i++)
for(int j=0; j < num_cols; j++)
c[i][j] = 0;
}

void cleanup(int** array, int x)
{
int i;

for(i=0; i<x; i++)
free(array[i]);
free(array);
}

int main(int argc, char *argv)
{
int i, j, method = 0;
int **crray2d;
int **c_d;

  crray2d = (int**)malloc(X*sizeof(int*));
  for(i=0; i<X; i++)
      crray2d[i] = (int*)malloc(Y*sizeof(int));

  if (method == 1)      
       MatAdd_C(crray2d, X, Y);

  else
  {    
       cutilSafeCall( cudaMalloc((void **) &c_d, X*sizeof(int*)) );
       //for(i=0; i<X; i++) 
       //    cutilSafeCall( cudaMalloc((void **) &c_d[i], Y*sizeof(int)) );
  
       /* i noticed that i can pass the first cudaMalloc */
       /* but i can not pass the for loop cudaMalloc     */
       /*  therefore below is not matter at this point   */

       //dim3 dimBlock(X,Y);
       //MatAdd<<<1,dimBlock>>>(c_d);

       //for(i=0;i < X;i++){
       //    cudaMemcpy(crray2d, c_d, Y*sizeof(int*), cudaMemcpyDeviceToHost) ; 
       //}
  }

  for(i=0; i<X; i++)
  {
    for(j=0; j<Y; j++) printf("%d ", crray2d[i][j]); 
    printf("\n");
  }
  
  cleanup(crray2d, X);
  cudaFree(c_d);

}

cudesnick · November 10, 2010, 3:36am

well… thanks for your reply again…

here i’d like to check to see if i am preparing array of pointers in device correctly.

so i think i don’t need to copy any data from host to device. (am i wrong on this?)

please see a note in the middle of the code.

i wonder why the following part is not working…

…

for(i=0; i<X; i++)

cutilSafeCall( cudaMalloc((void **) &c_d[i], Y*sizeof(int)) );

…

can i use cudaMallocArray for the memory allocation purpose for my case?

many thanks again in well advance.

include <stdlib.h>

include <string.h>

include <stdio.h>

include <cutil_inline.h>

define X 3

define Y 4

global void MatAdd(int **c)

{

int i = threadIdx.x;

int j = threadIdx.y;

c[i][j] = 0;

}

void MatAdd_C(int** c, int num_rows, int num_cols)

{

for (int i=0; i < num_rows; i++)
for(int j=0; j < num_cols; j++)

  c[i][j] = 0;
}

void cleanup(int** array, int x)

{

int i;

for(i=0; i<x; i++)
  free(array[i]);
free(array);

}

int main(int argc, char *argv)

{
  int i, j, method = 0;

  int **crray2d;  

  int **c_d;
crray2d = (int**)malloc(Xsizeof(int));
  for(i=0; i<X; i++)

      crray2d[i] = (int*)malloc(Y*sizeof(int));
if (method == 1)
       MatAdd_C(crray2d, X, Y);
else
  {    

       cutilSafeCall( cudaMalloc((void **) &c_d, X*sizeof(int*)) );

       //for(i=0; i<X; i++) 

       //    cutilSafeCall( cudaMalloc((void **) &c_d[i], Y*sizeof(int)) );
/* i noticed that i can pass the first cudaMalloc */
       /* but i can not pass the for loop cudaMalloc     */

       /*  therefore below is not matter at this point   */
//dim3 dimBlock(X,Y);
       //MatAdd<<<1,dimBlock>>>(c_d);
//for(i=0;i < X;i++){
       //    cudaMemcpy(crray2d, c_d, Y*sizeof(int*), cudaMemcpyDeviceToHost) ; 

       //}

  }
for(i=0; i<X; i++)
  {

    for(j=0; j<Y; j++) printf("%d ", crray2d[i][j]); 

    printf("\n");

  }
cleanup(crray2d, X);
  cudaFree(c_d);
}

When you do cudaMalloc((void **) &c_d[i], Y*sizeof(int)), the code that runs on the CPU (NOT GPU) allocates the memory within the device space and attempts to write the pointer to the newly allocated memory into the device memory space c_d[i], which has been previously allocated by another cudaMalloc. This memory c_d[i] does NOT exist within the process space on the host at all (had I mentioned it was within the device space?), and hence this write would most likely cause a segfault in the host code. You want to write the result of your cudaMalloc into the host array crray2d[i] instead of c_d[i] and then copy that whole array of device pointers crray2d to device array c_d using cudaMemcpy (you don’t have this copy coded yet) prior to invoking the kernel.

After and if the kernel has successfully returned (cudaThreadSynchronize(…), cudaGetLastError(…)), allocate each crray2d[i] ON THE HOST (regular malloc()) and copy from whatever crray2d[i] has been pointing at (PRIOR TO THIS re-initialization) on the device. You’ll probably need one temp integer pointer for this operation.

Just place a stack of pencils to the left of you on the desk, each pencil illustrating an 1d array on host, then move the pencils to the right (device memory), then back, as you go through you code. Make sure the appropriate malloc() or cudaMalloc(…) in the code precedes the move and cudaFree(…) and free(…) follows it, when needed.

cudesnick · November 10, 2010, 3:36am

well… thanks for your reply again…

here i’d like to check to see if i am preparing array of pointers in device correctly.

so i think i don’t need to copy any data from host to device. (am i wrong on this?)

please see a note in the middle of the code.

i wonder why the following part is not working…

…

for(i=0; i<X; i++)

cutilSafeCall( cudaMalloc((void **) &c_d[i], Y*sizeof(int)) );

…

can i use cudaMallocArray for the memory allocation purpose for my case?

many thanks again in well advance.

include <stdlib.h>

include <string.h>

include <stdio.h>

include <cutil_inline.h>

define X 3

define Y 4

global void MatAdd(int **c)

{

int i = threadIdx.x;

int j = threadIdx.y;

c[i][j] = 0;

}

void MatAdd_C(int** c, int num_rows, int num_cols)

{

for (int i=0; i < num_rows; i++)
for(int j=0; j < num_cols; j++)

  c[i][j] = 0;
}

void cleanup(int** array, int x)

{

int i;

for(i=0; i<x; i++)
  free(array[i]);
free(array);

}

int main(int argc, char *argv)

{
  int i, j, method = 0;

  int **crray2d;  

  int **c_d;
crray2d = (int**)malloc(Xsizeof(int));
  for(i=0; i<X; i++)

      crray2d[i] = (int*)malloc(Y*sizeof(int));
if (method == 1)
       MatAdd_C(crray2d, X, Y);
else
  {    

       cutilSafeCall( cudaMalloc((void **) &c_d, X*sizeof(int*)) );

       //for(i=0; i<X; i++) 

       //    cutilSafeCall( cudaMalloc((void **) &c_d[i], Y*sizeof(int)) );
/* i noticed that i can pass the first cudaMalloc */
       /* but i can not pass the for loop cudaMalloc     */

       /*  therefore below is not matter at this point   */
//dim3 dimBlock(X,Y);
       //MatAdd<<<1,dimBlock>>>(c_d);
//for(i=0;i < X;i++){
       //    cudaMemcpy(crray2d, c_d, Y*sizeof(int*), cudaMemcpyDeviceToHost) ; 

       //}

  }
for(i=0; i<X; i++)
  {

    for(j=0; j<Y; j++) printf("%d ", crray2d[i][j]); 

    printf("\n");

  }
cleanup(crray2d, X);
  cudaFree(c_d);
}

When you do cudaMalloc((void **) &c_d[i], Y*sizeof(int)), the code that runs on the CPU (NOT GPU) allocates the memory within the device space and attempts to write the pointer to the newly allocated memory into the device memory space c_d[i], which has been previously allocated by another cudaMalloc. This memory c_d[i] does NOT exist within the process space on the host at all (had I mentioned it was within the device space?), and hence this write would most likely cause a segfault in the host code. You want to write the result of your cudaMalloc into the host array crray2d[i] instead of c_d[i] and then copy that whole array of device pointers crray2d to device array c_d using cudaMemcpy (you don’t have this copy coded yet) prior to invoking the kernel.

After and if the kernel has successfully returned (cudaThreadSynchronize(…), cudaGetLastError(…)), allocate each crray2d[i] ON THE HOST (regular malloc()) and copy from whatever crray2d[i] has been pointing at (PRIOR TO THIS re-initialization) on the device. You’ll probably need one temp integer pointer for this operation.

Just place a stack of pencils to the left of you on the desk, each pencil illustrating an 1d array on host, then move the pencils to the right (device memory), then back, as you go through you code. Make sure the appropriate malloc() or cudaMalloc(…) in the code precedes the move and cudaFree(…) and free(…) follows it, when needed.

syoon · November 10, 2010, 6:14pm

[quote name=‘cudesnick’ date=‘09 November 2010 - 07:36 PM’ timestamp=‘1289360184’ post=‘1144254’]

When you do cudaMalloc((void **) &c_d[i], Y*sizeof(int)), the code that runs on the CPU (NOT GPU) allocates the memory within the device space and attempts to write the pointer to the newly allocated memory into the device memory space c_d[i], which has been previously allocated by another cudaMalloc. This memory c_d[i] does NOT exist within the process space on the host at all (had I mentioned it was within the device space?), and hence this write would most likely cause a segfault in the host code. You want to write the result of your cudaMalloc into the host array crray2d[i] instead of c_d[i] and then copy that whole array of device pointers crray2d to device array c_d using cudaMemcpy (you don’t have this copy coded yet) prior to invoking the kernel.

======================================

Well. thanks again. So much of insightful comments and i am so glad i find this site and your comments…

please bare with me for little longer i think i am getting there…

Since i may need crray2 later usage and you mention a temp array, i coded the following using a temp array.

since you know whats going on here so i skipped before and after…

int *c_d_host; / temp array */

…

CUDA_SAFE_CALL( cudaMalloc((void **) &c_d, Xsizeof(int)) );

c_d_host = (int**)malloc(Xsizeof(int)); /* assign temp */

for(i=0; i<X; i++)

   CUDA_SAFE_CALL( cudaMalloc((void **) &c_d_host[i], Y*sizeof(int)) );

for(i=0; i<X; i++)

   CUDA_SAFE_CALL( cudaMemcpy(c_d[i], c_d_host[i], Y*sizeof(int*), cudaMemcpyHostToDevice) );

the code could not pass the last copying part.

syoon · November 10, 2010, 6:14pm

[quote name=‘cudesnick’ date=‘09 November 2010 - 07:36 PM’ timestamp=‘1289360184’ post=‘1144254’]

When you do cudaMalloc((void **) &c_d[i], Y*sizeof(int)), the code that runs on the CPU (NOT GPU) allocates the memory within the device space and attempts to write the pointer to the newly allocated memory into the device memory space c_d[i], which has been previously allocated by another cudaMalloc. This memory c_d[i] does NOT exist within the process space on the host at all (had I mentioned it was within the device space?), and hence this write would most likely cause a segfault in the host code. You want to write the result of your cudaMalloc into the host array crray2d[i] instead of c_d[i] and then copy that whole array of device pointers crray2d to device array c_d using cudaMemcpy (you don’t have this copy coded yet) prior to invoking the kernel.

======================================

Well. thanks again. So much of insightful comments and i am so glad i find this site and your comments…

please bare with me for little longer i think i am getting there…

Since i may need crray2 later usage and you mention a temp array, i coded the following using a temp array.

since you know whats going on here so i skipped before and after…

int *c_d_host; / temp array */

…

CUDA_SAFE_CALL( cudaMalloc((void **) &c_d, Xsizeof(int)) );

c_d_host = (int**)malloc(Xsizeof(int)); /* assign temp */

for(i=0; i<X; i++)

   CUDA_SAFE_CALL( cudaMalloc((void **) &c_d_host[i], Y*sizeof(int)) );

for(i=0; i<X; i++)

   CUDA_SAFE_CALL( cudaMemcpy(c_d[i], c_d_host[i], Y*sizeof(int*), cudaMemcpyHostToDevice) );

the code could not pass the last copying part.

cudesnick · November 10, 2010, 7:19pm

[quote name=‘syoon’ date=‘10 November 2010 - 02:14 PM’ timestamp=‘1289412873’ post=‘1144540’]

When you do cudaMalloc((void **) &c_d[i], Y*sizeof(int)), the code that runs on the CPU (NOT GPU) allocates the memory within the device space and attempts to write the pointer to the newly allocated memory into the device memory space c_d[i], which has been previously allocated by another cudaMalloc. This memory c_d[i] does NOT exist within the process space on the host at all (had I mentioned it was within the device space?), and hence this write would most likely cause a segfault in the host code. You want to write the result of your cudaMalloc into the host array crray2d[i] instead of c_d[i] and then copy that whole array of device pointers crray2d to device array c_d using cudaMemcpy (you don’t have this copy coded yet) prior to invoking the kernel.

======================================

Well. thanks again. So much of insightful comments and i am so glad i find this site and your comments…

please bare with me for little longer i think i am getting there…

Since i may need crray2 later usage and you mention a temp array, i coded the following using a temp array.

since you know whats going on here so i skipped before and after…

int *c_d_host; / temp array */

…

CUDA_SAFE_CALL( cudaMalloc((void **) &c_d, Xsizeof(int)) );

c_d_host = (int**)malloc(Xsizeof(int)); /* assign temp */

for(i=0; i<X; i++)
   CUDA_SAFE_CALL( cudaMalloc((void **) &c_d_host[i], Y*sizeof(int)) );   
for(i=0; i<X; i++)
   CUDA_SAFE_CALL( cudaMemcpy(c_d[i], c_d_host[i], Y*sizeof(int*), cudaMemcpyHostToDevice) );
the code could not pass the last copying part.

I don’t understand this piece of code. In my previous posting I was mentioning the temp variable (you don’t need a whole array, but that should work too) w.r.t copying the result data from the device to the host, after the kernel has completed. The last line in your new piece of code clearly attempts to copy from host to device, which I don’t find relevant to my previous posting.

cudesnick · November 10, 2010, 7:19pm

[quote name=‘syoon’ date=‘10 November 2010 - 02:14 PM’ timestamp=‘1289412873’ post=‘1144540’]

When you do cudaMalloc((void **) &c_d[i], Y*sizeof(int)), the code that runs on the CPU (NOT GPU) allocates the memory within the device space and attempts to write the pointer to the newly allocated memory into the device memory space c_d[i], which has been previously allocated by another cudaMalloc. This memory c_d[i] does NOT exist within the process space on the host at all (had I mentioned it was within the device space?), and hence this write would most likely cause a segfault in the host code. You want to write the result of your cudaMalloc into the host array crray2d[i] instead of c_d[i] and then copy that whole array of device pointers crray2d to device array c_d using cudaMemcpy (you don’t have this copy coded yet) prior to invoking the kernel.

======================================

Well. thanks again. So much of insightful comments and i am so glad i find this site and your comments…

please bare with me for little longer i think i am getting there…

Since i may need crray2 later usage and you mention a temp array, i coded the following using a temp array.

since you know whats going on here so i skipped before and after…

int *c_d_host; / temp array */

…

CUDA_SAFE_CALL( cudaMalloc((void **) &c_d, Xsizeof(int)) );

c_d_host = (int**)malloc(Xsizeof(int)); /* assign temp */

for(i=0; i<X; i++)
   CUDA_SAFE_CALL( cudaMalloc((void **) &c_d_host[i], Y*sizeof(int)) );   
for(i=0; i<X; i++)
   CUDA_SAFE_CALL( cudaMemcpy(c_d[i], c_d_host[i], Y*sizeof(int*), cudaMemcpyHostToDevice) );
the code could not pass the last copying part.

I don’t understand this piece of code. In my previous posting I was mentioning the temp variable (you don’t need a whole array, but that should work too) w.r.t copying the result data from the device to the host, after the kernel has completed. The last line in your new piece of code clearly attempts to copy from host to device, which I don’t find relevant to my previous posting.

syoon · November 10, 2010, 8:03pm

In previous advice,

“You want to write the result of your cudaMalloc into the host array crray2d[i] instead of c_d[i] and then copy that whole array of device pointers crray2d to device array c_d using cudaMemcpy prior to invoking the kernel.”

Now I figured I may need crray2d for later use, I did cudaMalloc using ‘c_d_host’ and cudaMemcpy ‘c_d_host’ to c_d…

did i misunderstood your comments?

Thanks again.

syoon · November 10, 2010, 8:03pm

In previous advice,

“You want to write the result of your cudaMalloc into the host array crray2d[i] instead of c_d[i] and then copy that whole array of device pointers crray2d to device array c_d using cudaMemcpy prior to invoking the kernel.”

Now I figured I may need crray2d for later use, I did cudaMalloc using ‘c_d_host’ and cudaMemcpy ‘c_d_host’ to c_d…

did i misunderstood your comments?

Thanks again.

cudesnick · November 10, 2010, 8:54pm

You shouldn’t have allocated anything on the host into elements of crray2d prior to kernel invocation: you should have only allocated device arrays into elements of this host array and hence should have initialized crray2d[i] with the device pointers. Obviously, you should have allocated the array crray2d itself on the host.

After the kernel has successfully returned, you should allocate HOST arrays into crray2d[i] and copy device arrays into these host arrays. You must be careful within this step to not loose pointers to device arrays stored within crray2d[i] prior to the allocation of the host arrays hence the use of a temp pointer variable. Once the contents of a device 1d array has been copied to the corresponding host 1d array, that device 1d array should be immediately freed in order to not leak the device RAM.

cudesnick · November 10, 2010, 8:54pm

You shouldn’t have allocated anything on the host into elements of crray2d prior to kernel invocation: you should have only allocated device arrays into elements of this host array and hence should have initialized crray2d[i] with the device pointers. Obviously, you should have allocated the array crray2d itself on the host.

After the kernel has successfully returned, you should allocate HOST arrays into crray2d[i] and copy device arrays into these host arrays. You must be careful within this step to not loose pointers to device arrays stored within crray2d[i] prior to the allocation of the host arrays hence the use of a temp pointer variable. Once the contents of a device 1d array has been copied to the corresponding host 1d array, that device 1d array should be immediately freed in order to not leak the device RAM.

Topic		Replies	Views
multi dimension array CUDA Programming and Performance	26	32766	February 12, 2010
How to cudaMalloc two-dimensional array ? CUDA Programming and Performance	46	66110	September 7, 2023
CUDA 2D Array Problem Need help to manipulate 2D arrays in CUDA CUDA Programming and Performance	4	26436	March 17, 2011
Passing a multidimensional array to kernel how to allocate space in host and pass to device? CUDA Programming and Performance	12	16181	November 22, 2014
Can we do malloc inside a __global__ function CUDA Programming and Performance	26	9650	February 21, 2010
2D Array Not Updated CUDA Programming and Performance	6	5235	May 4, 2010
Using unified memory for 2Dim and 3 Dim array CUDA Programming and Performance	2	528	November 18, 2018
Troubles converting float array to float2 array CUDA Programming and Performance	11	3510	July 3, 2017
Help with cuda 2d array CUDA Programming and Performance	6	7447	September 29, 2014
Setting up 3d arryas I have some questions about how to use 3d arrays and cudaArrays CUDA Programming and Performance	10	27882	April 5, 2010

2d array testing in very simple code using CUDA

Related topics