I have a loop which runs a million times.Now to port the code on CUDA i have to apply threads.Data on which the calculations are done is also data parallel according to me.
So when i call the kernel,i cant make one million threads.At the max i can apply 60,000 - 65000 threads.In this please tell me how to apply the thread concept and how to port my normal C code to CUDA
All the examples which i have seen till now have total number of points less than 60,000,so was no problem on how to apply the thread and launch the kernel function.
Can Anyone please help me with this?
Thanks in advance