Device management between 8 cores and 4 GPUs

I have five boxes and each box has eight cores and four GPU cards (in addition to the video card). (Total of 40 cores and 20 GPUs…)
I’m using MPI to run parallel code on the cluster.
I want to be able to use all 40 cores in parallel, with each core executing a kernal (roughly) simultaneously. I have not been able to find an example of how to use the device management API calls to use the devices that as they are available. (Also, devices 0,2,3,4 are the GPU cards that I want to use, but device 1 is the video card that I DON’T want to use.)