Gauss-Seidel iterations GPU

Hello, I am new again in this and I would like you to help me with a query, in my code I am taking a matrix to a function called “solver” that applies Gauss-Seidel for iterations, now I am at the point of passing it to the format of cuda, when I enter a matrix value of 2 it seems to work fine but when I enter a greater number than 3 I no longer print anything and the execution time of the kernel is 0.
I do not know where it is failing or what I am doing wrong, this is the total code that I have implemented so far, if someone can help me as I must modify the solver function to work correctly I would appreciate it.

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define MAX_ITER 1000000
#define MAX 100 //maximum value of the matrix element
#define TOL 0.000001

// Generate a random float number with the maximum value of max


float rand_float(int max){
  return ((float)rand()/(float)(RAND_MAX)) * max;
}


// Allocate 2D matrix


void allocate_init_2Dmatrix(float ***mat,  int n, int m){
  int i, j;
  *mat = (float **) malloc(n * sizeof(float *));
  for(i = 0; i < n; i++) {
    (*mat)[i] = (float *)malloc(m * sizeof(float));
    for (j = 0; j < m; j++)
      (*mat)[i][j] = rand_float(MAX);
  }
}


// solver


__global__ void solver(float **matd, int n, int m){

  float diff = 0, temp;
  int done = 0, cnt_iter = 0;
 int j= blockIdx.y*blockDim.y + threadIdx.y;
 int i= blockIdx.x*blockDim.x + threadIdx.x;

 while (!done && (cnt_iter < MAX_ITER)){
    diff = 0;
  if( i > 0 && j > 0 && (i=1) < (n-1) && (j=1) <(m-1)){
       temp = (matd)[i][j];
        printf("temp[%d][%d]: %f\n",i,j,temp);
        (matd)[i][j] = 0.2 * ((matd)[i][j] + (matd)[i][j-1 ] + (matd)[i-1 ][j] + (matd)[i][j + 1] + (matd)[i + 1][j]);
        diff += abs((matd)[i][j] - temp);
//        printf("diff:%f\n",diff);
      }

    if (diff/n/n < TOL)
      done = 1;
    cnt_iter ++;
  }

    if (done)
      printf("Solver converged after %d iterations\n", cnt_iter);

    else
      printf("Solver not converged after %d iterations\n", cnt_iter);
}


int main(int argc, char *argv[]) {

  int n;
  float **a,**ad;
struct timeval start, end,start1, end1;
    double mtime, seconds, useconds,x,mtime1, seconds1, useconds1,y;
    gettimeofday(&start, NULL);
        dim3 DimGrid(2,2);
        dim3 DimBlock(2,2);
cudaEvent_t startcuda;
cudaEvent_t stopcuda;
  if (argc < 2) {
    printf("Call this program with two parameters: matrix_size communication \n");
    printf("\t matrix_size: Add 2 to a power of 2 (e.g. : 18, 1026)\n");
    exit(1);
  }

  n = atoi(argv[1]);
 // float **temp;
 float **tem;
  printf("Matrix size = %d\n", n);
  allocate_init_2Dmatrix(&a, n, n);

// Allocate 2D array in Device

cudaMalloc(&ad, n*sizeof(float *));
for(int i=0;i<n;i++){
 cudaMalloc(&tem[i],n*sizeof(float));
 cudaMemcpy(tem[i],a[i],n*sizeof(float),cudaMemcpyHostToDevice);
 }
cudaMemcpy(ad,tem,n*n*sizeof(float),cudaMemcpyHostToDevice);
udaEventCreate(&startcuda);
cudaEventCreate(&stopcuda);
gettimeofday(&start1, NULL);
cudaEventRecord(startcuda,0);
  solver<<<DimGrid, DimBlock>>>(ad, n, n);
cudaEventRecord(stopcuda,0);

/* for (int i = 0; i < n; i++){
  cudaMemcpy(a[i],temp[i],n*sizeof(float),cudaMemcpyHostToDevice);
}*/
cudaFree(ad);
cudaEventSynchronize(stopcuda);
float tiempocuda;
cudaEventElapsedTime(&tiempocuda,startcuda,stopcuda);
printf("valor tiempo kernel:%f milisegundos\n",tiempocuda);
gettimeofday(&end1, NULL);
    seconds1 = end1.tv_sec - start1.tv_sec;
    useconds1 = end1.tv_usec - start1.tv_usec;
    mtime1 = ((seconds1)*1000+ useconds1/1000);
    y=mtime1/1000;
    printf("\nTiempo calculo de funcion solver es: %g segundos", y);
gettimeofday(&end, NULL);
    seconds = end.tv_sec - start.tv_sec;
    useconds = end.tv_usec - start.tv_usec;
    mtime = ((seconds)*1000+ useconds/1000);
    x=mtime/1000;
    printf("\nTiempo total de programa: %g segundos\n", x);

  return 0;
}

This is not correct:

if( i > 0 && j > 0 && (i=1) < (n-1) && (j=1) <(m-1)){

study the example I gave you here:

https://devtalk.nvidia.com/default/topic/392370/cuda-programming-and-performance/how-to-cudamalloc-two-dimensional-array-/post/5308738/#5308738

Furthermore you cannot iterate inside the kernel the way you are trying to do. CUDA threads do not all execute at the same time.

For understanding, I suggest you start by:

  1. separate the input and output matrix for the kernel
  2. perform just a single update/iteration in the kernel
  3. move the while loop outside the kernel - call the kernel from a while loop in host code

what might be better than trying to fix all the issues with your code is if you study this:

http://developer.download.nvidia.com/compute/developertrainingmaterials/samples/cuda_c/Jacobi_Optimization.zip

The only thing missing from that code is the convergence stop/test.

Yes, that condition was only a test since in the sequential view those values ​​must start at 1 since within the variable temp saves the value of matd [1] [1] to perform the iterations; I have created variables to divide the matrix in the kernel and for each block to perform iterations, I tried to remove the loop (while) from the host but I can not find a way to do it since the diff variable evaluates if it meets the condition I would be inside the kernel and I could not get it out and in the case of leaving it inside the kernel, the variables (done and cnt_iter) should be declared on both sides and are not the same value.
I know that my code is not correct but I would like to know how to perform these iterations, also with the doubt of what value should go in the variables i and j, either blockIdx.x * blockDim.x + threadIdx.x, or the value from which to take each block that I have defined it with the variable istart and istop

This is my code in the kernel

__global__ void solver(float **matd, int n, int m){

  float diff = 0, temp;
  
 int j= threadIdx.x;
 int i= threadIdx.x;
 int nthreads,istart,istop,ngrid;
 nthreads = threadIdx.x;
 ngrid = blockDim.x;
 istart = nthreads*(n-1)/ngrid;
 istop = (nthreads+1)*(m-1)/ngrid;

  
    diff = 0;
        i=istart+1;
        j=istart+1;
  if( i > 0 && j > 0 && i < (istop-1) && j <(istop-1)){
       temp = (matd)[i][j];
        printf("temp[%d][%d]: %f\n",i,j,temp);
        (matd)[i][j] = 0.2 * ((matd)[i][j] + (matd)[i][j-1 ] + (matd)[i-1 ][j] + (matd)[i][j + 1] + (matd)[i + 1][j]);
        diff += abs((matd)[i][j] - temp);
//        printf("diff:%f\n",diff);
      }
}

And this is how i tried to call in the host code

while (!done && (cnt_iter < MAX_ITER)){
 cudaEventRecord(startcuda,0);
 solver<<<DimGrid, DimBlock>>>(ad, n, n);
 cudaEventRecord(stopcuda,0);
 if (diff/n/n < TOL)
      done = 1;
    cnt_iter ++;
  }
    if (done)
      printf("Solver converged after %d iterations\n", cnt_iter);

    else
      printf("Solver not converged after %d iterations\n", cnt_iter);

Here is a simplified example of what I was suggesting:

$ cat t374.cu
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define MAX_ITER 1000
#define MAX 100 //maximum value of the matrix element
#define TOL 0.1

// Generate a random float number with the maximum value of max


float rand_float(int max){
  return ((float)rand()/(float)(RAND_MAX)) * max;
}


// Allocate 2D matrix


void allocate_init_2Dmatrix(float ***mat,  int n, int m){
  int i, j;
  *mat = (float **) malloc(n * sizeof(float *));
  for(i = 0; i < n; i++) {
    (*mat)[i] = (float *)malloc(m * sizeof(float));
    for (j = 0; j < m; j++)
      (*mat)[i][j] = rand_float(MAX);
  }
}


// solver


__global__ void solver(float **matdi, float **matdo, int n, int m, float *diff){

 int i= blockIdx.y*blockDim.y + threadIdx.y;
 int j= blockIdx.x*blockDim.x + threadIdx.x;

  if( (i > 0) && (j > 0) && (i < (n-1)) && (j <(m-1))){
        (matdo)[i][j] = 0.2 * ((matdi)[i][j] + (matdi)[i][j-1 ] + (matdi)[i-1 ][j] + (matdi)[i][j + 1] + (matdi)[i + 1][j]);
        atomicAdd(diff, abs((matdo)[i][j] - (matdi)[i][j]));
      }


}


int main(int argc, char *argv[]) {

  int n, cnt_iter=0;
  float **a,**adi, **ado, *d_diff;
  if (argc < 2) {
    printf("Call this program with two parameters: matrix_size communication \n");
    printf("\t matrix_size: Add 2 to a power of 2 (e.g. : 18, 1026)\n");
    exit(1);
    }

  n = atoi(argv[1]);
 // float **temp;
  float **temi, **temo;
  temi = new float*[n];
  temo = new float*[n];
  printf("Matrix size = %d\n", n);
  allocate_init_2Dmatrix(&a, n, n);

// Allocate 2D array in Device

  cudaMalloc(&adi, n*sizeof(float *));
  cudaMalloc(&ado, n*sizeof(float *));
  for(int i=0;i<n;i++){
    cudaMalloc(&(temi[i]),n*sizeof(float));
    cudaMalloc(&(temo[i]),n*sizeof(float));
    cudaMemcpy(temi[i],a[i],n*sizeof(float),cudaMemcpyHostToDevice);
    cudaMemcpy(temo[i],a[i],n*sizeof(float),cudaMemcpyHostToDevice);
    }
  cudaMemcpy(adi,temi,n*sizeof(float *),cudaMemcpyHostToDevice);
  cudaMemcpy(ado,temo,n*sizeof(float *),cudaMemcpyHostToDevice);
  float h_diff = n*n;
  dim3 DimBlock(32,8);
  dim3 DimGrid((n+DimBlock.x-1)/DimBlock.x,(n+DimBlock.y-1)/DimBlock.y);
  cudaMalloc(&d_diff, sizeof(float));
  cudaMemset(d_diff, 0, sizeof(float));
  while ((cnt_iter < MAX_ITER) && ((h_diff/(n*n)) > TOL)) {
    solver<<<DimGrid, DimBlock>>>(adi, ado, n, n, d_diff);
    cudaMemcpy(&h_diff, d_diff, sizeof(float), cudaMemcpyDeviceToHost);
    cudaMemset(d_diff, 0, sizeof(float));
    cnt_iter++;
    float **adt = adi; // ping-pong input and output buffers
    adi = ado;
    ado = adt;
    }
  printf("cnt_iter = %d, diff = %f\n", cnt_iter, h_diff/(n*n));
  for (int i = 0; i < n; i++){
    cudaMemcpy(a[i],temi[i],n*sizeof(float),cudaMemcpyDeviceToHost);
  }

  return 0;
}
$ nvcc t374.cu -o t374
$ cuda-memcheck ./t374 18
========= CUDA-MEMCHECK
Matrix size = 18
cnt_iter = 18, diff = 0.097259
========= ERROR SUMMARY: 0 errors
$ cuda-memcheck ./t374 100
========= CUDA-MEMCHECK
Matrix size = 100
cnt_iter = 18, diff = 0.095983
========= ERROR SUMMARY: 0 errors
$

Thank you very much for your help Robert, in the case that you perform the calculation implmentation, what I would like to try to do is to know if it is possible to do it on a single pointer that enters the kernel since the iterations depend on the previous values ​​since that is why was implemented already variable temp, also what I need to obtain is the execution time that only takes the kernel and the total of the application making changes in the size of the blocks, I tested with symmetric values ​​from 4x4 up to 32x32 and what not I understand esque from the 8x8 threads onwards the time tends to grow in comparison to the first time value with a block size of 4x4, I do not know if there is any kind of divergence by the threads.

I don’t think you really understand the nature of CUDA thread execution. There is no defined order of thread execution. Threads are not all executing at the same time, in lockstep. Therefore doing it with a single pointer (having some threads modify the input before other threads have read those input values) is not going to work. Your temp variable doesn’t sort this out.