cudaErrorNoDevice when submitting an MPI job to multiple nodes

Hi,
I’m testing MPI and CUDA performance on RHEL7 clusters each of which has four P100 cards. One weird issue is that GPU cards are not detected only when I submit an MPI job to multiple nodes. For example,
bsub -q devgpu.q -m “devicegpu01” -n 2 mpirun my_program
works well as expected but an equivalent command
bsub -q devgpu.q -m “devicegpu01 devicegpu02” -R “span[ptile=1]” -n 2 mpirun my_program
doesn’t work giving cudaErrorNoDevice messages.
Since my other programs which do not use any CUDA are working well for both the above commands, I guess there’s some issue with my CUDA setting. Could you please give any comments?
Thanks.