Some progress. What I did was create a profiler .csv file before running the profiler
and then import this file the file menue. This avoids trying to run my exe via the profiler.
That is I use the linux commands
setenv CUDA_PROFILE 1
setenv CUDA_PROFILE_CSV 1
setenv CUDA_PROFILE_LOG test.csv
setenv CUDA_PROFILE_CONFIG myfile
setenv LD_LIBRARY_PATH “$LD_LIBRARY_PATH”:/usr/local/cuda/computeprof/bin
run my cuda program
and then run /usr/local/cuda/computeprof/bin/computeprof
I am of course still not sure what the numbers mean. But attached should be a nice plot showing
the best compute performace I have squeeze out of half of a GTX 295 (265 instructons per microsecond).
I this is the actual performance for one of the GTX 295’s multiprocessor blocks. Given the clock is 1.24Ghz
this seems to mean an average of 1 instrauction every 4.7 clock tics.
This is for an arteficial compute bound kernel with 0 divergence, 0 warp serialisation (but shared memory is
used), no use of constants.
In contrast the kernel I want to use, works out at about 51 instructions per microsecond.
Does anyone else have figures they are prepared to share?
ps Under Centos computeprof help seems to create an assistant process which when computeprof is exited
often goes into a cpu hogging loop and has to be kill PID by hand.
pps I should have said the above is for 32 threads per block. It increases to 374 instructions per microsecond
with 96 threads per block (and the same for 128, 256 and 512).