implement parallelism among procedures?

I think that CUDA offers a instruction-level parallel execution enginee,but in the usual problems, we often meet two functions without any dependencies,so that if we want to improve the efficiency of the program,we must implement the procedure-level parallelism,

Can we use CUDA to do this?

It may be too hard for CUDA-C but how about PTX?

Technically you can do something like this

int tid = threadIdx.x + blockDim.....

if(tid < 2048)

  //do procedure 1

else if(2048 <= tid < 4096)

  //do procedure 2

else ...

That’s a poor man’s task parallelism. CUDA is smart enough to do load balancing in this situation. I’ve tested it on something like:

if(tid < 2048)

   //compute a hundred MADs and store the result

else if(tid < 4096)

  //compute two hundred MADs and store the result

This code turned out to be exactly as fast as one that calculated 150 MADs with all 4096 threads.

Now, there are a few problems with this:

  • your kernels get huge and ugly

  • your register usage is for the worst case branch

  • for non-trivial kernels load balancing may not turn out so great

  • you have to manually partition the data and computation

We’ve been waiting for parallel kernel execution for a long time but apparently the GPU logic isn’t smart enough yet.