Computeprof with mpi application

I have a MPI application which uses GPUs on individual machines to perform some tasks.
I would like to profile the GPU part using computeprof, but I am unable to do so.
In the session setting panel, I use mpiexec for launch and “-configfile myconfig” for arguments.
The myconfig file has the configuration to run the application, in this case just 1 process.
It fails. Any ideas? how to make it work?
Sunil

Under Linux, i profile my MPI/CUDA applications in terminal mode this way:

export COMPUTE_PROFILE=1

export COMPUTE_PROFILE_CSV=1

export COMPUTE_PROFILE_CONFIG=~/script/prof_counter1.sh

export COMPUTE_PROFILE_LOG=computeprof1.csv

the prof_counter.sh looks like that: (I took profiler counter available for my 1.3 cc GPU from Compute_Profiler.txt in /usr/local/cuda/doc )

timestamp              #: Time stamps for kernel launches and memory transfers. 

gpustarttimestamp      #: Time stamp when kernel starts execution in GPU. 

#gpuendtimestamp        #: Time stamp when kernel ends execution in GPU. This 

gridsize               #: Number of blocks in a grid along the X and Y dimensions 

threadblocksize        #: Number of threads in a block along the X, Y and Z 

dynsmemperblock        #: Size of dynamically allocated shared memory per block in 

stasmemperblock        #: Size of statically allocated shared memory per block in 

regperthread           #: Number of registers used per thread for a kernel launch.

memtransferdir         #: Memory transfer direction, a direction value of 0 is used 

memtransfersize        #: Memory transfer size in bytes. This option shows the amount 

memtransferhostmemtype #: Host memory type (pageable or page-locked). This option 

streamid               #: Stream Id for a kernel launch

local_load        #:  Number of executed local load instructions per warp in a SM

local_store       #:  Number of executed local store instructions per warp in a SM

gld_request       #:  Number of executed global load instructions per warp in a SM

gst_request       #:  Number of executed global store instructions per warp in a SM

#divergent_branch  #:  Number of unique branches that diverge

#branch            #:  Number of unique branch instructions in program

#sm_cta_launched   #:  Number of threads blocks executed on a SM

#gld_incoherent   #: Non-coalesced (incoherent) global memory loads

#gld_coherent     #: Coalesced (coherent) global memory loads

#gld_32b          #: 32-byte global memory load transactions

#gld_64b          #: 64-byte global memory load transactions

#gld_128b         #: 128-byte global memory load transactions

#gst_incoherent   #: Non-coalesced (incoherent) global memory stores

#gst_coherent     #: Coalesced (coherent) global memory stores

#gst_32b          #: 32-byte global memory store transactions

#gst_64b          #: 64-byte global memory store transactions

#gst_128b         #: 128-byte global memory store transactions

#instructions     #: Instructions executed

#warp_serialize   #: Number of thread warps that serialize on address conflicts 

#cta_launched     #: Number of threads blocks executed

But you must take care that all your mpi process will try to write to the same “computeprof1.csv”. One workaround i used were to the setenv() function in C and i set COMPUTE_PROFILE_LOG to a value which included mpirank.

Goog luck!