I’m porting an algorithm developed for CPU using CUDA, and there are some nested function calls. Some functions have output arrays whose addresses are passed by pointers.
For example, suppose that there are device functions,
device void local_device_function(float* A, float *B) // A is an input, B is an output.
for (i=0; i<10; i++)
… Blah blah …
B[i] = A[i]
device void caller_device_function(float *A)
.... blah blah ..... local_device_function(A,B);
However, this seems not to work. I think it is because B is defined not using cudaMalloc in the host code. It works only when I define it in the host code using cudaMalloc.
How can I pass the array data from the local_device_function to the caller function? Do I have to allocate sufficient memory space in the host code using cudaMalloc just for temporarily used memory space?
Because there are many functions like this, I would like to find another way.