Problem with cudaprof when executing a multi process program

Hi,

I am trying to profile a multi process CUDA program. Multi process management is coded with MPI (OpenMPI). For now, it is test program, which is pretty simple: each process does the same thing, a sequence of cudaMemCpy and kernel launches. These cuda functions are executed in parallel by each process, leaving the scheduling to cuda runtime.

What I would like to do would be to vizualize this scheduling. The “GPU Time Width Plot” in cudaprof seemed to be the good tool for that, but after trials and trials, it seems that cudaprof does not support multi process program:

[list=1]

[*] cudaprof does not finish the execution everytime: sometimes it could display the following error “Error in reading profiler output.” I assume there are some race condition when it writes to the profile file.

[*] When it finishes, the “GPU Time Width Plot” only show one process. And when I look at the “Summary Table”, I can see that it takes into account the cuda function for only one process.

I have several questions:

[list=a]

[*] am I missing something here?

[*] is there any option I could set in cudaprof to make it works?

[*] is there a nvidia reference saying that cudaprof is not designed to do that and / or a “work inprogress” reference about this issue?

[*] does it exist another profile tool to do that? (a Linux tool if possible)

Thank you

there was previous posts like this one about it :http://news.nvidia.com:8080/t/136150/13391286/5749/0/

I hadn’t succeed in using cudaprof with MPI directly in the Cuda Visual Profiler

But, you can generate cudaprof file in terminal mode with

export CUDA_PROFILE=1 → that enable the cuda profiling

export CUDA_PROFILE_CSV=1 → outputs the profiling in a csv format which will be readable by the visual profiler

export CUDA_PROFILE_CONFIG=~/script/cudaprofile.sh → set the value you wan’t to profile (since you can only profile 4 hardware counter by run)

[codebox]#The profiler supports the following options:

#Time stamps for kernel launches and memory transfers.

#This can be used for timeline analysis.

timestamp

#Number of blocks in a grid along the X and Y dimensions for a kernel launch

gridsize

#Number of threads in a block along the X, Y and Z dimensions for a kernel launch

threadblocksize

#Size of dynamically allocated shared memory per block in bytes for a kernel launch

dynsmemperblock

#Size of statically allocated shared memory per block in bytes for a kernel launch

stasmemperblock

#Number of registers used per thread for a kernel launch

regperthread

#Memory transfer direction

#a direction value of 0 is used for host->device memory copies and a value of 1 is used for device->host

memtransferdir

#Memory copy size in bytes

memtransfersize

#Stream Id for a kernel launch

streamid

##The profiler supports logging of following counters during kernel execution

##There is a max of 4 profiler counters

##Non-coalesced (incoherent) global memory loads (always zero on coputa capability 1.3)

#gl_incoherent

##Non-coalesced (incoherent) global memory loads

#gld_coherent

##32-byte global memory load transactions

#gld_32b

##64-byte global memory load transactions

gld_64b

##128-byte global memory load transactions

gld_128b

##Global memory loads invalid on compute capability 1.3

#gld_request

##Non-coalesced (incoherent) global memory stores (always zero on coputa capability 1.3)

#gst_incoherent

##Coalesced (coherent) global memory stores

#gst_coherent

##32-byte global memory store transactions

#gst_32b

##64-byte global memory store transactions

#gst_64b

##128-byte global memory store transactions

#gst_128b

##Gobal memory stores invalid on compute capability 1.3

#gst_request

##Local memory loads

#local_load

##Local memory stores

#local_store

##Branches taken by threads executing a kernel

#branch

##Divergent branches taken by threads executing a kernel

divergent_branch

##Instructions executed

instructions

##Number of thread warps that serialize on address conflicts to either shared or constant memory

#warp_serialize

##Number of threads blocks executed

#cta_launched

[/codebox]

export CUDA_PROFILE_LOG=output_%d.csv → set the name of the output, the %d should allow you to have one file per gpu.

(the %d trick didn’t work for me, so in my mpi prog, i used the following code

[codebox]char *gpu_prof_log;

gpu_prof_log=getenv(“CUDA_PROFILE_LOG”);

if(gpu_prof_log){

char tmp[50];

sprintf(tmp,"process%d_%s",mpi_rank,gpu_prof_log);

setenv("CUDA_PROFILE_LOG",tmp,1);

}

[/codebox]