Hi,
I’m trying to profile my kernel with opencl nvidia profiler,
but I can’t find anything about the code needed to enable the profiling,
anyone has an idea about the code command?
Thanks.
Hi,
I’m trying to profile my kernel with opencl nvidia profiler,
but I can’t find anything about the code needed to enable the profiling,
anyone has an idea about the code command?
Thanks.
Not sure of what you needs but you don’t need to modify your code to use the profiler. just run it and select your binary and its options, itactive four you the hardwarve counter collecting.
Actually I’m trying to use OpenCL visual profiler, it takes .oclpj file, and shows a table with kernel performances
I don’t know how to get the .oclpj from my visual project
the oclpj file is the format of your saved project after profiling.
if your cuda app is on the same machine as your profiler it’s easy to use you just clic on ‘start’ and fill information about your application.
otherwise, you will need to do this: (linux syntax)
export CUDA_PROFILE=1
export CUDA_PROFILE_CSV=1
export CUDA_PROFILE_CONFIG=~/script/cudaprofile.sh
where the cudaprofile.sh contains the counter you need:
[codebox]#The profiler supports the following options:
#Time stamps for kernel launches and memory transfers.
#This can be used for timeline analysis.
timestamp
#Number of blocks in a grid along the X and Y dimensions for a kernel launch
gridsize
#Number of threads in a block along the X, Y and Z dimensions for a kernel launch
threadblocksize
#Size of dynamically allocated shared memory per block in bytes for a kernel launch
dynsmemperblock
#Size of statically allocated shared memory per block in bytes for a kernel launch
stasmemperblock
#Number of registers used per thread for a kernel launch
regperthread
#Memory transfer direction
#a direction value of 0 is used for host->device memory copies and a value of 1 is used for device->host
memtransferdir
#Memory copy size in bytes
memtransfersize
#Stream Id for a kernel launch
streamid
##The profiler supports logging of following counters during kernel execution
##There is a max of 4 profiler counters
##Non-coalesced (incoherent) global memory loads (always zero on coputa capability 1.3)
#gl_incoherent
##Non-coalesced (incoherent) global memory loads
#gld_coherent
##32-byte global memory load transactions
#gld_32b
##64-byte global memory load transactions
gld_64b
##128-byte global memory load transactions
gld_128b
##Global memory loads invalid on compute capability 1.3
#gld_request
##Non-coalesced (incoherent) global memory stores (always zero on coputa capability 1.3)
#gst_incoherent
##Coalesced (coherent) global memory stores
#gst_coherent
##32-byte global memory store transactions
#gst_32b
##64-byte global memory store transactions
#gst_64b
##128-byte global memory store transactions
#gst_128b
##Gobal memory stores invalid on compute capability 1.3
#gst_request
##Local memory loads
#local_load
##Local memory stores
#local_store
##Branches taken by threads executing a kernel
#branch
##Divergent branches taken by threads executing a kernel
divergent_branch
##Instructions executed
instructions
##Number of thread warps that serialize on address conflicts to either shared or constant memory
#warp_serialize
##Number of threads blocks executed
#cta_launched
[/codebox]