I am implementing protein docking. Since it takes quite a long time, I want to switch to CUDA. For a particular set of euler angles(psi,theta,phi), the code runs independently. Earlier I used multiprocessing modules of python to run the code parallel for every set of angles.
1.How to implement the same using CUDA?
2.Can I distribute my calculations among threads so that each thread gives me an independent result?
I have 1024 threads per block.
MY_CODE:
for psi in range(0,360,15):
for theta in range(0,180,15):
for phi in range(0,360,15):
dataset.append([psi, theta, phi])
main_program( ith element of dataset i.e [psi,theta,phi] ) {
run program
give output
}
Now I want to provide each element( set of angles) of the dataset to different threads and make it parallel where I can use 1024 threads of each block.