CUDA visual profiler using mpi?

Hi,

I have an mpi code that using gpus. I need to profile the code, and was wondering how one does this using the cuda visual profiler. It doesn’t seem obvious how to do this. My code in question uses only a single GPU but has a separate process running simultaneously.

Thanks!

I am quite interested by the topic. If you are using only one gpu, it seems to me that it is quite simple, launching your mpi application through cuda visual profiler should word, isn’t it ?

The result you can obtain this way should be coherent as it will select your device’s harware counter.

I used to try doing this in terminal mode by the past with:

[codebox]export CUDA_PROFILE=1

export CUDA_PROFILE_CSV=1

export CUDA_PROFILE_CONFIG=~/.cudaprof1.config

[/codebox]

the .config file contains the counters you want to profile (i modifed the nvidia doc in order to just have to comment or uncomment what i need in the following file :

[codebox]#The profiler supports the following options:

#Time stamps for kernel launches and memory transfers.

#This can be used for timeline analysis.

timestamp

#Number of blocks in a grid along the X and Y dimensions for a kernel launch

gridsize

#Number of threads in a block along the X, Y and Z dimensions for a kernel launch

threadblocksize

#Size of dynamically allocated shared memory per block in bytes for a kernel launch

dynsmemperblock

#Size of statically allocated shared memory per block in bytes for a kernel launch

stasmemperblock

#Number of registers used per thread for a kernel launch

regperthread

#Memory transfer direction

#a direction value of 0 is used for host->device memory copies and a value of 1 is used for device->host

memtransferdir

#Memory copy size in bytes

memtransfersize

#Stream Id for a kernel launch

streamid

##The profiler supports logging of following counters during kernel execution

##There is a max of 4 profiler counters

##Non-coalesced (incoherent) global memory loads (always zero on coputa capability 1.3)

#gl_incoherent

##Non-coalesced (incoherent) global memory loads

#gld_coherent

##32-byte global memory load transactions

gld_32b

##64-byte global memory load transactions

gld_64b

##128-byte global memory load transactions

gld_128b

##Global memory loads invalid on compute capability 1.3

#gld_request

##Non-coalesced (incoherent) global memory stores (always zero on coputa capability 1.3)

#gst_incoherent

##Coalesced (coherent) global memory stores

#gst_coherent

##32-byte global memory store transactions

gst_32b

##64-byte global memory store transactions

gst_64b

##128-byte global memory store transactions

gst_128b

##Gobal memory stores invalid on compute capability 1.3

#gst_request

##Local memory loads

local_load

##Local memory stores

local_store

##Branches taken by threads executing a kernel

branch

##Divergent branches taken by threads executing a kernel

divergent_branch

##Instructions executed

instructions

##Number of thread warps that serialize on address conflicts to either shared or constant memory

warp_serialize

##Number of threads blocks executed

cta_launched[/codebox]

please get your feedback in what you have experimented and worked.

Thanks