there was previous posts like this one about it :http://news.nvidia.com:8080/t/136150/13391286/5749/0/
I hadn’t succeed in using cudaprof with MPI directly in the Cuda Visual Profiler
But, you can generate cudaprof file in terminal mode with
export CUDA_PROFILE=1 → that enable the cuda profiling
export CUDA_PROFILE_CSV=1 → outputs the profiling in a csv format which will be readable by the visual profiler
export CUDA_PROFILE_CONFIG=~/script/cudaprofile.sh → set the value you wan’t to profile (since you can only profile 4 hardware counter by run)
[codebox]#The profiler supports the following options:
#Time stamps for kernel launches and memory transfers.
#This can be used for timeline analysis.
timestamp
#Number of blocks in a grid along the X and Y dimensions for a kernel launch
gridsize
#Number of threads in a block along the X, Y and Z dimensions for a kernel launch
threadblocksize
#Size of dynamically allocated shared memory per block in bytes for a kernel launch
dynsmemperblock
#Size of statically allocated shared memory per block in bytes for a kernel launch
stasmemperblock
#Number of registers used per thread for a kernel launch
regperthread
#Memory transfer direction
#a direction value of 0 is used for host->device memory copies and a value of 1 is used for device->host
memtransferdir
#Memory copy size in bytes
memtransfersize
#Stream Id for a kernel launch
streamid
##The profiler supports logging of following counters during kernel execution
##There is a max of 4 profiler counters
##Non-coalesced (incoherent) global memory loads (always zero on coputa capability 1.3)
#gl_incoherent
##Non-coalesced (incoherent) global memory loads
#gld_coherent
##32-byte global memory load transactions
#gld_32b
##64-byte global memory load transactions
gld_64b
##128-byte global memory load transactions
gld_128b
##Global memory loads invalid on compute capability 1.3
#gld_request
##Non-coalesced (incoherent) global memory stores (always zero on coputa capability 1.3)
#gst_incoherent
##Coalesced (coherent) global memory stores
#gst_coherent
##32-byte global memory store transactions
#gst_32b
##64-byte global memory store transactions
#gst_64b
##128-byte global memory store transactions
#gst_128b
##Gobal memory stores invalid on compute capability 1.3
#gst_request
##Local memory loads
#local_load
##Local memory stores
#local_store
##Branches taken by threads executing a kernel
#branch
##Divergent branches taken by threads executing a kernel
divergent_branch
##Instructions executed
instructions
##Number of thread warps that serialize on address conflicts to either shared or constant memory
#warp_serialize
##Number of threads blocks executed
#cta_launched
[/codebox]
export CUDA_PROFILE_LOG=output_%d.csv → set the name of the output, the %d should allow you to have one file per gpu.
(the %d trick didn’t work for me, so in my mpi prog, i used the following code
[codebox]char *gpu_prof_log;
gpu_prof_log=getenv(“CUDA_PROFILE_LOG”);
if(gpu_prof_log){
char tmp[50];
sprintf(tmp,"process%d_%s",mpi_rank,gpu_prof_log);
setenv("CUDA_PROFILE_LOG",tmp,1);
}
[/codebox]