I am making an implementation of parallel prefix sum based on the paper: “Efficient Parallel Scan Algorithms for GPUs” by: Shubhabrata Sengupta, Mark Harris, and Michael Garland FYI.
The problem is more related to CUDA programming. The problem is that I have a CUDA kernal that I call with device pointers. one pointer d_ptr to perform a parallel multiblock scan on and another pointer block_result to collect partial results from last thread of every block. The problem is that it seems like the host code is called directly after a kernel invocation (which is OK according to programming guide: kernel calls are async). My host code needs the device pointers d_ptr and block_result. Especially block_result to perform a recursive call with it to calculate the prefix sum of block_result (if number of elements spans several blocks). The reason to have a kernel to invoke multiblock scan is that thread blocks are synchronized after finishing the kernel. the problem is that after calling this kernel I call another host function that invokes another kernel with block_result. It seems like when this function is called d_ptr and block_result just stores garbage. Debugging with printfs I can see that my kernels do not finish kernel function when host function is called with pointers. Shouldn’t CUDA see that a host function is using a pointer that a kernel usees and then wait for it to finish execution before it calls host function?
A example to clarify:
//inside some host function
T *d_ptr;
T *block_result;
//call multiblock scan kernel
scan_multi_block<<< ..... >>>(d_ptr, block_result);
//block syncronized, asynchronous execution of kernel
//call host function to process block_result
foo(block_result);
__host__ foo(T* block_result)
{
//call another kernel with block_result
another_kernel<<< ... >>>(block_result);
...
}
Above code it seems it did not finish scan_multi_block and write the data to d_ptr and block_result before calling foo
What am i doing wrong? how can I achieve what I described?