Dear CUDA fellows,
If this topic was answered before, my apologies, but the search functions doesn’t work for me.
My question it’s about how to launch kernel(s) to process data length different from power of two:
I know 2 approaches.
One is to launch first the closest power of two with one kernel and then launch a second one with the rest of the data, that avoids to launch iddle threads.
Second is just launch one kernel that calculates thread id (aka tid) and before any code I must put a IF sentence:
if ( tid < Total_length )
the first approach never call a iddle thread but I require to make 2 kernel launch and the second kernel must be arranged to access the rest of the data (offset). The second approach avoids second kernels and any offset calculation but it calls iddle threads in the last block.
I just looking for nice method and efficient. What do you guys ussually do to overcome this situation?