How to apply thread concept to a FOR loop of one million points

I have a loop which runs a million times.Now to port the code on CUDA i have to apply threads.Data on which the calculations are done is also data parallel according to me.
So when i call the kernel,i cant make one million threads.At the max i can apply 60,000 - 65000 threads.In this please tell me how to apply the thread concept and how to port my normal C code to CUDA
All the examples which i have seen till now have total number of points less than 60,000,so was no problem on how to apply the thread and launch the kernel function.
Can Anyone please help me with this?

Thanks in advance

You can have 65534x65534 blocks, each containing 512 threads - that is about 2.2e12 total threads, ie. over two million million.

“You can’t have more than 60k threads” as in “you don’t know how” or “it hangs if you try”?

Thanks for the reply but this thing didnt help me
Can u please me with a sample code or method on how to do it?
thanks