Hi, I’m CUDA novice, and happy to know this forum.
Now, I’m trying to parallelize my code with cuda.
My code has computationally intensive routines. In this routine, there is three for-loops which are repeated 5x5x4 times.
So, at the first step, I’d like to parallelize this code. Now, here is a question.
Which one is more efficient?
Also, could you suggest better way if any?