I have a MPI application which uses GPUs on individual machines to perform some tasks.
I would like to profile the GPU part using computeprof, but I am unable to do so.
In the session setting panel, I use mpiexec for launch and “-configfile myconfig” for arguments.
The myconfig file has the configuration to run the application, in this case just 1 process.
It fails. Any ideas? how to make it work?
Sunil
Under Linux, i profile my MPI/CUDA applications in terminal mode this way:
export COMPUTE_PROFILE=1
export COMPUTE_PROFILE_CSV=1
export COMPUTE_PROFILE_CONFIG=~/script/prof_counter1.sh
export COMPUTE_PROFILE_LOG=computeprof1.csv
the prof_counter.sh looks like that: (I took profiler counter available for my 1.3 cc GPU from Compute_Profiler.txt in /usr/local/cuda/doc )
timestamp #: Time stamps for kernel launches and memory transfers.
gpustarttimestamp #: Time stamp when kernel starts execution in GPU.
#gpuendtimestamp #: Time stamp when kernel ends execution in GPU. This
gridsize #: Number of blocks in a grid along the X and Y dimensions
threadblocksize #: Number of threads in a block along the X, Y and Z
dynsmemperblock #: Size of dynamically allocated shared memory per block in
stasmemperblock #: Size of statically allocated shared memory per block in
regperthread #: Number of registers used per thread for a kernel launch.
memtransferdir #: Memory transfer direction, a direction value of 0 is used
memtransfersize #: Memory transfer size in bytes. This option shows the amount
memtransferhostmemtype #: Host memory type (pageable or page-locked). This option
streamid #: Stream Id for a kernel launch
local_load #: Number of executed local load instructions per warp in a SM
local_store #: Number of executed local store instructions per warp in a SM
gld_request #: Number of executed global load instructions per warp in a SM
gst_request #: Number of executed global store instructions per warp in a SM
#divergent_branch #: Number of unique branches that diverge
#branch #: Number of unique branch instructions in program
#sm_cta_launched #: Number of threads blocks executed on a SM
#gld_incoherent #: Non-coalesced (incoherent) global memory loads
#gld_coherent #: Coalesced (coherent) global memory loads
#gld_32b #: 32-byte global memory load transactions
#gld_64b #: 64-byte global memory load transactions
#gld_128b #: 128-byte global memory load transactions
#gst_incoherent #: Non-coalesced (incoherent) global memory stores
#gst_coherent #: Coalesced (coherent) global memory stores
#gst_32b #: 32-byte global memory store transactions
#gst_64b #: 64-byte global memory store transactions
#gst_128b #: 128-byte global memory store transactions
#instructions #: Instructions executed
#warp_serialize #: Number of thread warps that serialize on address conflicts
#cta_launched #: Number of threads blocks executed
But you must take care that all your mpi process will try to write to the same “computeprof1.csv”. One workaround i used were to the setenv() function in C and i set COMPUTE_PROFILE_LOG to a value which included mpirank.
Goog luck!