nested parallelism

I am trying to implement an algorithm that needs to perform approximately 10000 dot products to independent vectors of size of approximately 5000. As a program written for a CPU, this would be a nested loop where the inner loop computes the dot-product while the outer traverses the 10000 elements that need to have dot products performed.

I am new to CUDA and GPGPU, but wondering what would be the best way of parallelising this? From what I understand, it is possible to parallelize the actual dot-product operation and also carry out this on the 10000 elements simultaneously. But I am not sure if both can be done at the same time? Would be greatly appreciated if someone can put me onto a good example.

Each thread of your inner FOR loop could be considered as a thread.

If you run out of threads – then threads could run a FOR loop to cover the entire set of iterations…

Recently, I parallelized a code that had 4 FOR loops inside. We mapped 1 thread to the quadrapule <outerForloopIndex, middleLoopIndex1, MiddleLoopindex2, InnerLoopIndex>. It jus worked like breeze.

best Regards