I think that CUDA offers a instruction-level parallel execution enginee,but in the usual problems, we often meet two functions without any dependencies,so that if we want to improve the efficiency of the program,we must implement the procedure-level parallelism,
Can we use CUDA to do this?
It may be too hard for CUDA-C but how about PTX?
Technically you can do something like this
int tid = threadIdx.x + blockDim.....
if(tid < 2048)
//do procedure 1
else if(2048 <= tid < 4096)
//do procedure 2
else ...
That’s a poor man’s task parallelism. CUDA is smart enough to do load balancing in this situation. I’ve tested it on something like:
if(tid < 2048)
//compute a hundred MADs and store the result
else if(tid < 4096)
//compute two hundred MADs and store the result
This code turned out to be exactly as fast as one that calculated 150 MADs with all 4096 threads.
Now, there are a few problems with this:
-
your kernels get huge and ugly
-
your register usage is for the worst case branch
-
for non-trivial kernels load balancing may not turn out so great
-
you have to manually partition the data and computation
We’ve been waiting for parallel kernel execution for a long time but apparently the GPU logic isn’t smart enough yet.