Implementing 3D rotations with CUDA

I am implementing protein docking. Since it takes quite a long time, I want to switch to CUDA. For a particular set of euler angles(psi,theta,phi), the code runs independently. Earlier I used multiprocessing modules of python to run the code parallel for every set of angles.
1.How to implement the same using CUDA?
2.Can I distribute my calculations among threads so that each thread gives me an independent result?

I have 1024 threads per block.

MY_CODE:

for psi in range(0,360,15):
    for theta in range(0,180,15):
        for phi in range(0,360,15):
            dataset.append([psi, theta, phi])

main_program( ith element of dataset i.e [psi,theta,phi] ) {
 run program
 give output
}

Now I want to provide each element( set of angles) of the dataset to different threads and make it parallel where I can use 1024 threads of each block.