I am trying to implement an algorithm that needs to perform approximately 10000 dot products to independent vectors of size of approximately 5000. As a program written for a CPU, this would be a nested loop where the inner loop computes the dot-product while the outer traverses the 10000 elements that need to have dot products performed.
I am new to CUDA and GPGPU, but wondering what would be the best way of parallelising this? From what I understand, it is possible to parallelize the actual dot-product operation and also carry out this on the 10000 elements simultaneously. But I am not sure if both can be done at the same time? Would be greatly appreciated if someone can put me onto a good example.