I am implementing protein docking. Since it takes quite a long time, I want to switch to CUDA. For a particular set of euler angles(psi,theta,phi), the code runs independently. Earlier I used multiprocessing modules of python to run the code parallel for every set of angles.

1.How to implement the same using CUDA?

2.Can I distribute my calculations among threads so that each thread gives me an independent result?

I have 1024 threads per block.

MY_CODE:

```
for psi in range(0,360,15):
for theta in range(0,180,15):
for phi in range(0,360,15):
dataset.append([psi, theta, phi])
main_program( ith element of dataset i.e [psi,theta,phi] ) {
run program
give output
}
```

Now I want to provide each element( set of angles) of the dataset to different threads and make it parallel where I can use 1024 threads of each block.