Convert HOST Loop to CUDA

Good Afternoon,

This may be a simple/stupid question, but can anyone tell me how to convert a loop into a CUDA kernel using the following simple/toy C-code?

#include <stdio.h>

int main(int argc, char **argv){
  float result1 = 0.0f;
  float result2 = 0.0f;
  float *array;
  int numElems = 128;

  array = (float*)malloc(numElems*sizeof(float));
  // initialize with sample data
  for(int i = 0; i < numElems; ++i){
    array[i] = i * 1.5f;
  }

  // how to convert this to CUDA ?
  for(int i = 0; i < numElems; ++i){
    result1 = result1 * array[i];
    result2 = result2 * (array[i]*0.5f);
  }

  free(array);

  return 0;
}

In the end I don’t want to modify the input array to compute the final result1 and result2 variables.

I have a feeling that the loop defined in the above C-code may not be as straightforward as it seems.

Thanks in advance for any help.

You don’t need CUDA or a loop to process this. The answer is zero for both result1 and result2.

In the more general case where you have initialized result1 and result2 to something other than zero (and, for completeness, lets assume your array is not a simple arithmetic sequence), this can be accomplished with a parallel reduction.

Considering the calculation for result1, the only difference from the typical treatise on parallel reduction:

https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

is that you are taking the product in every case, instead of the sum. If you sort that out, I think the method to calculate result2 will be immediately obvious.

Thanks,

You are correct, reduction is the likely answer. Also, thank you for the link.

The information for reduction was very good - thank you.

I know that CUB has some operations for something similar - do you have any recommendation(s) regarding using CUB to reduce? Any sample/example code using CUB for the type of reduction(s) I am attempting would be greatly appreciated.

Thanks again.

Here’s a sample code with cub reduce that shows use of a user-supplied reduction operator:

https://nvlabs.github.io/cub/structcub_1_1_device_reduce.html#aa4adabeb841b852a7a5ecf4f99a2daeb

So make that operator multiplication.

To do this in the complex domain, the easiest way would be to include

#include <thrust/complex.h>

in your code. A bit of googling will show you how to use thrust::complex

Thanks again for all help - it is greatly appreciated.

Unfortunately I can’t use thrust.

The one caveat that I have with regard to reduction, is that the array element can’t be modified - I suppose I could make a copy of the array and employ it for reduction but that seems like a lot of extra overhead?

You can construct your own complex multiply operator without thrust. read about complex number arithmetic on wikipedia.

The reduction operation does not modify the array elements.

Cool. I guess I need to read up more before asking, thank you for being patient.

Quick question: Is there any problem with passing an array to CUB reduction that is already allocated on the DEVICE?

Thanks again.

That would be the only way to use cub.

Cool. Just making sure. Thank you

I got it working, thank you very much Robert_Crovella. I posted the final working version at:

https://devtalk.nvidia.com/default/topic/1062732/gpu-accelerated-libraries/cub-reduction-with-complex-number-and-multiplication/?offset=2#5381589

Just in case others could make use of the solution.