I am quite interested by the topic. If you are using only one gpu, it seems to me that it is quite simple, launching your mpi application through cuda visual profiler should word, isn’t it ?
The result you can obtain this way should be coherent as it will select your device’s harware counter.
I used to try doing this in terminal mode by the past with:
[codebox]export CUDA_PROFILE=1
export CUDA_PROFILE_CSV=1
export CUDA_PROFILE_CONFIG=~/.cudaprof1.config
[/codebox]
the .config file contains the counters you want to profile (i modifed the nvidia doc in order to just have to comment or uncomment what i need in the following file :
[codebox]#The profiler supports the following options:
#Time stamps for kernel launches and memory transfers.
#This can be used for timeline analysis.
timestamp
#Number of blocks in a grid along the X and Y dimensions for a kernel launch
gridsize
#Number of threads in a block along the X, Y and Z dimensions for a kernel launch
threadblocksize
#Size of dynamically allocated shared memory per block in bytes for a kernel launch
dynsmemperblock
#Size of statically allocated shared memory per block in bytes for a kernel launch
stasmemperblock
#Number of registers used per thread for a kernel launch
regperthread
#Memory transfer direction
#a direction value of 0 is used for host->device memory copies and a value of 1 is used for device->host
memtransferdir
#Memory copy size in bytes
memtransfersize
#Stream Id for a kernel launch
streamid
##The profiler supports logging of following counters during kernel execution
##There is a max of 4 profiler counters
##Non-coalesced (incoherent) global memory loads (always zero on coputa capability 1.3)
#gl_incoherent
##Non-coalesced (incoherent) global memory loads
#gld_coherent
##32-byte global memory load transactions
gld_32b
##64-byte global memory load transactions
gld_64b
##128-byte global memory load transactions
gld_128b
##Global memory loads invalid on compute capability 1.3
#gld_request
##Non-coalesced (incoherent) global memory stores (always zero on coputa capability 1.3)
#gst_incoherent
##Coalesced (coherent) global memory stores
#gst_coherent
##32-byte global memory store transactions
gst_32b
##64-byte global memory store transactions
gst_64b
##128-byte global memory store transactions
gst_128b
##Gobal memory stores invalid on compute capability 1.3
#gst_request
##Local memory loads
local_load
##Local memory stores
local_store
##Branches taken by threads executing a kernel
branch
##Divergent branches taken by threads executing a kernel
divergent_branch
##Instructions executed
instructions
##Number of thread warps that serialize on address conflicts to either shared or constant memory
warp_serialize
##Number of threads blocks executed
cta_launched[/codebox]
please get your feedback in what you have experimented and worked.
Thanks