Hello,
I’m new to cuda programming and want to run an algorithm until a certain value is reached. The problem is that I don’t want to transfer the data back to the CPU after every cycle to decide if another cycle is necessary and my GPU doesn’t support dynamic parallelism. Is it possible to return a single value from a cuda kernel? Is there a way to call the kernel recursively and is it possible to call cublas functions from inside a kernel without dynamic parallelism?
You can return data from a kernel to the host via the cudaMemcpy… API calls, or you can use zero-copy pinned memory. Another method is via Unified memory on Kepler GPUs, although UM is effectively doing cudaMemcpy-type operations under the hood.
Kernels can call device functions recursively, but cannot call kernels recursively unless there is support for dynamic parallelism. The only way to call cublas functions from inside a kernel is via dynamic parallelism.
Why not just use cudaMemcpy after the kernel call? It’s likely to cost a few microseconds for two floats.
Are you optimizing your application now where you are trying to get rid of a few microseconds? If so, then zero-copy might be quicker, but it also depends on your exact access patterns to those values in your kernel.
Really it seems like we’re splitting hairs here. With a little bit of effort, you can try both methods in a short period of time.