Profiling MPI + CUDA

How to profile CUDA kernels running in MPI programs?