Device function call from globalcu

Hi. I need to implement a global function (in CUDA C) that calls a serial function many times. The serial function is recursive and returns an array of 5 elements, I want the global function to return the same array for each thread. The code should be something like this:

struct matrix {
int rig;
int col;
float* N;
int* M;
};

__device__ float serial_func(struct matrix* A, struct matrix* B){
        if(-----)
               for (int j = 0; j < 5; j++) {
                       A->N[j] = B->N[j];
               }
               serial_func(A, B);
        else
              for (int j = 0; j < 5; j++) {
                       A->N[j] = B->N[j];
               }
       return 0;
}

__global__ void kernelFunc(?){
       int i = threadIdx.x;
       if(i<100)
            serial_func();
}

How can I implement it? Which parameters do I have to insert inside the global function?

Hello @john455, welcome to the NVIDIA developer forums!

I think this question is perfectly suited for our dedicated CUDA coding category.

I hope you don’t mind me moving this over to that category.

Thanks!

1 Like

How would you write the code in C++?

If you make the following compilable and matching your intent, I expect someone can help you with whatever remains:

struct matrix {
int rig;
int col;
float* N;
int* M;
};

float serial_func(struct matrix* A, struct matrix* B){
        if(-----)
               for (int j = 0; j < 5; j++) {
                       A->N[j] = B->N[j];
               }
               serial_func(A, B);
        else
              for (int j = 0; j < 5; j++) {
                       A->N[j] = B->N[j];
               }
       return 0;
}

void kernelFunc(int i, ?){
       if(i<100)
            serial_func();
}

int main(){

  int max = 100;
  for (int i = 0; i < max; i++) kernelFunc(i, ?);
}

And if you intend to have every thread work on the same A and B, then this isn’t going to work. You can’t have multiple threads updating the same locations in memory:

                   A->N[j] = B->N[j];

without a lot more work than what is shown here.

Actually, I did it in C, but I have some problems in doing the same in CUDA C. The idea would be to create a global function where each thread calls the serial (device) function once. Also, it seems that the device functions don’t admit recursion.

Right, I imagined you had, and what I was suggesting was that you fill out the code I already provided to make it complete. If you want to do that, I’ll take another look. If not perhaps someone else will be able to help.

You can certainly do a recursive device function in CUDA, and there are multiple forum posts about it if you want to see an example. It does raise the issue of stack depth that needs to be carefully considered.

struct matrix {
int rig;
int col;
float* N;
int* M;
};

float serial_func(struct matrix* A, struct matrix* B){
    if(A->N[n] != -2)
           for (int j = 0; j < 5; j++) {
                   A->N[j] = B->N[j];
           }
           serial_func(A, B);
    else
          for (int j = 0; j < 5; j++) {
                   A->N[j] = B->N[j];
           }
   return 0;
}

void kernelFunc(int i, struct matrix* A, struct matrix* B){
   if(i<100)
        serial_func(A, B);
}

int main(){

  int max = 100;
  for (int i = 0; i < max; i++) kernelFunc(i, &A, &B);
}

This should make sense

that code won’t compile.