How to cudaMalloc two-dimensional array ?

Hi everyone

I am having problems with my code when passing it to the device, now my host matrix is ​​created without problems in the allocate_init_2Dmatrix call function and in turn it already places the values ​​inside the matrix, when creating the ad pointer corresponding to the one that will go to the device in the solver function is not carrying anything and immediately leaves the function, I have placed a printf inside the solver to verify the value of the pointer but I do not print anything, this is my code, I have implemented it as the options I found but the problem I think lies in that to go to the kernel should be the address of a pointer float, ie a float ***, this is my code, if someone can help me I would appreciate it a lot

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define MAX_ITER 1000000
#define MAX 100 //maximum value of the matrix element
#define TOL 0.000001

// Generate a random float number with the maximum value of max
float rand_float(int max){
  return ((float)rand()/(float)(RAND_MAX)) * max;
}


// Allocate 2D matrix
void allocate_init_2Dmatrix(float ***mat,  int n, int m){
  int i, j;
  *mat = (float **) malloc(n * sizeof(float *));
  for(i = 0; i < n; i++) {
    (*mat)[i] = (float *)malloc(m * sizeof(float));
    for (j = 0; j < m; j++)
      (*mat)[i][j] = rand_float(MAX);
  }

}


// solver
__global__ void solver(float ***matd, int n, int m){
  float diff = 0, temp;
  int done = 0, cnt_iter = 0;
 int j= blockIdx.x*blockDim.x + threadIdx.x;
 int i= blockIdx.y*blockDim.y + threadIdx.y;
printf("valor de matd:%f\n",***matd);
  while (!done && (cnt_iter < MAX_ITER)){
    diff = 0;
      if (i < n - 1 && j < m - 1){
       temp = (*matd)[i][j];
        (*matd)[i][j] = 0.2 * ((*matd)[i][j] + (*matd)[i][j - 1] + (*matd)[i - 1][j] + (*matd)[i][j + 1] + (*matd)[i + 1][j]);
        diff += abs((*matd)[i][j] - temp);
  //      printf("diff:%f\n",diff);
      }

    if (diff/n/n < TOL)
      done = 1;
    cnt_iter ++;
  }

  if (done)
    printf("Solver converged after %d iterations\n", cnt_iter);
  else
    printf("Solver not converged after %d iterations\n", cnt_iter);

}

int main(int argc, char *argv[]) {
  int n;
  float **a,**ad;
struct timeval start, end,start1, end1;
    double mtime, seconds, useconds,x,mtime1, seconds1, useconds1,y;
    gettimeofday(&start, NULL);
        dim3 DimGrid(10);
        dim3 DimBlock(128);
  if (argc < 2) {
    printf("Call this program with two parameters: matrix_size communication \n");
    printf("\t matrix_size: Add 2 to a power of 2 (e.g. : 18, 1026)\n");

    exit(1);
  }

  n = atoi(argv[1]);
  float *temph[n];
  printf("Matrix size = %d\n", n);
  allocate_init_2Dmatrix(&a, n, n);





//  cudaMalloc(&ad,n*n*sizeof(float ));
//  cudaMemcpy(ad,a,n*n*sizeof(float),cudaMemcpyHostToDevice);
// Allocate 2D array in Device
  cudaMalloc((void **)&ad,n*sizeof(float *));
 for (int i = 0; i < n; i++){
        cudaMalloc(&temph[i], n*sizeof(float));
}
  cudaMemcpy(ad,temph,n*sizeof(float *),cudaMemcpyHostToDevice);

 for (int i = 0; i < n; i++){
  cudaMemcpy(temph[i],a[i],n*sizeof(float *),cudaMemcpyHostToDevice);
}


gettimeofday(&start1, NULL);
//printf("valor de ad:%f\n",&a);
  solver<<<DimGrid, DimBlock>>>(&ad, n, n);
cudaMemcpy(a,ad,n*n*sizeof(float),cudaMemcpyDeviceToHost);
cudaFree(ad);
gettimeofday(&end1, NULL);
    seconds1 = end1.tv_sec - start1.tv_sec;
    useconds1 = end1.tv_usec - start1.tv_usec;
    mtime1 = ((seconds1)*1000+ useconds1/1000);
    y=mtime1/1000;
    printf("\nTiempo calculo de funcion solver es: %g segundos", y);
gettimeofday(&end, NULL);
    seconds = end.tv_sec - start.tv_sec;
    useconds = end.tv_usec - start.tv_usec;
    mtime = ((seconds)*1000+ useconds/1000);
    x=mtime/1000;
    printf("\nTiempo total de programa: %g segundos\n", x);

  return 0;
}

In the future I suggest you start a new thread, rather than adding to this thread that is about 6.5 years old.

There are a variety of errors in your code.

  1. You cannot pass &ad as an argument to your kernel. The address of that variable is a host pointer. Attempting to dereference a host pointer in device code is illegal. Furthermore, this construct is not needed for the 2D allocation scheme you seem to have in mind.

  2. We need to convert your kernel code to use (matd) directly, not (*matd)

  3. Your kernel appears to be doing jacobi iteration/relaxation, and is using the four neighbors at each point. However your bounds checking is only testing for 2 edges, you need to test for all 4 edges, to prevent out-of-bounds access when accessing the -1 and +1 neighbors.

  4. Your device allocation code was not quite correct. When actually allocating for data for each pointer, you want to allocate for (float) not (float *) storage.

  5. The copy of data from device to host was not correct. The copy there needs to be similar to the copy in the direction from host to device (i.e. with a loop, using your temph pointer storage).

Here’s a modified code that runs without execution errors (I have no idea if the calculations are correct or not). In the future I suggest using proper CUDA error checking and run your code with cuda-memcheck, before asking others for help:

$ cat t367.cu
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define MAX_ITER 1000000
#define MAX 100 //maximum value of the matrix element
#define TOL 0.000001

// Generate a random float number with the maximum value of max
float rand_float(int max){
  return ((float)rand()/(float)(RAND_MAX)) * max;
}

// Allocate 2D matrix
void allocate_init_2Dmatrix(float ***mat,  int n, int m){
  int i, j;
  *mat = (float **) malloc(n * sizeof(float *));
  for(i = 0; i < n; i++) {
    (*mat)[i] = (float *)malloc(m * sizeof(float));
    for (j = 0; j < m; j++)
      (*mat)[i][j] = rand_float(MAX);
  }

}

// solver
__global__ void solver(float **matd, int n, int m, bool debug){
  float diff = 0, temp;
  int done = 0, cnt_iter = 0;
 int j= blockIdx.x*blockDim.x + threadIdx.x;
 int i= blockIdx.y*blockDim.y + threadIdx.y;
 if (debug) printf("valor de matd:%f\n",**matd);
  while (!done && (cnt_iter < MAX_ITER)){
    diff = 0;
      if ((i < n - 1) && (j < m - 1) && (i > 0) && (j > 0)){
       temp = (matd)[i][j];
        (matd)[i][j] = 0.2 * ((matd)[i][j] + (matd)[i][j - 1] + (matd)[i - 1][j] + (matd)[i][j + 1] + (matd)[i + 1][j]);
        diff += abs((matd)[i][j] - temp);
  //      printf("diff:%f\n",diff);
      }

    if (diff/n/n < TOL)
      done = 1;
    cnt_iter ++;
  }
  if (debug){
    if (done)
      printf("Solver converged after %d iterations\n", cnt_iter);
    else
      printf("Solver not converged after %d iterations\n", cnt_iter);
  }
}

int main(int argc, char *argv[]) {
  int n;
  float **a,**ad;
struct timeval start, end,start1, end1;
    double mtime, seconds, useconds,x,mtime1, seconds1, useconds1,y;
    gettimeofday(&start, NULL);
        dim3 DimGrid(10);
        dim3 DimBlock(128);
  if (argc < 2) {
    printf("Call this program with two parameters: matrix_size communication \n");
    printf("\t matrix_size: Add 2 to a power of 2 (e.g. : 18, 1026)\n");

    exit(1);
  }

  n = atoi(argv[1]);
  float *temph[n];
  printf("Matrix size = %d\n", n);
  allocate_init_2Dmatrix(&a, n, n);

// Allocate 2D array in Device
  cudaMalloc((void **)&ad,n*sizeof(float *));
 for (int i = 0; i < n; i++){
        cudaMalloc(&temph[i], n*sizeof(float));
}
  cudaMemcpy(ad,temph,n*sizeof(float *),cudaMemcpyHostToDevice);

 for (int i = 0; i < n; i++){
  cudaMemcpy(temph[i],a[i],n*sizeof(float),cudaMemcpyHostToDevice);
}

Hello Robert, thank you very much for your prompt reply, I have already been able to fix the inconvenient and I have been able to pass the matrix to the GPU, the problem I have now is about the iterations since it uses Gauss-Seidel I do not know how to implement the condition so that it works Well, I’ve put the code that I have so far in a new topic as you recommended me, I’m new to this of Cuda so I hope you can guide me a bit on how to implement it

https://devtalk.nvidia.com/default/topic/1046233/cuda-programming-and-performance/gauss-seidel-iterations-gpu/

float **cpuArray = someValid2DHostarray;

for(int i=0; i<5; i++)

{

   cudaMemcpy(h_temp[i], cpuArray[i], 3*sizeof(float), cudaMemcpyHostToDevice);

}

I know this thread is ancient however this memcpy throws "argument of type “float” is incompatible with parameter of type “void *”. I am completely lost of how this is solveable.
I hope you are having a good time all these years.

with a proper definition of h_temp. You haven’t shown that. The definition you need is something like:

float *h_temp[5];

as shown here

you are right, I wonder why that is. I tried dynamically allocating it

You can do a dynamic allocation if you wish, then you need to start with something like

float **h_temp;

The topic of multidimensional array allocation has a lot of variety and has been covered in many places. here is a summary post.

1 Like