cuda profiler cudaMemcpy linux cuda visual profiler breaks working program

I have a working CUDA prog which I would like to profile.
When I run it via NVIDIA Compute Visula Profiler Version 3.2.0
the profiler bottom window says that the exe has an “unknown error”
on the line with the first cudaMemcpy(). This after cudaMalloc()
etc which do not report errors. The profiler says my program
“failed, exit code:255”.
As I’v said, the same image works fine when used outside the profiler.

BTW why does this inferface mess with what I type so bady?

Many thanks
Bill

Some progress. What I did was create a profiler .csv file before running the profiler
and then import this file the file menue. This avoids trying to run my exe via the profiler.

That is I use the linux commands
setenv CUDA_PROFILE 1
setenv CUDA_PROFILE_CSV 1
setenv CUDA_PROFILE_LOG test.csv
setenv CUDA_PROFILE_CONFIG myfile
setenv LD_LIBRARY_PATH “$LD_LIBRARY_PATH”:/usr/local/cuda/computeprof/bin
run my cuda program
and then run /usr/local/cuda/computeprof/bin/computeprof

I am of course still not sure what the numbers mean. But attached should be a nice plot showing
the best compute performace I have squeeze out of half of a GTX 295 (265 instructons per microsecond).
I this is the actual performance for one of the GTX 295’s multiprocessor blocks. Given the clock is 1.24Ghz
this seems to mean an average of 1 instrauction every 4.7 clock tics.

This is for an arteficial compute bound kernel with 0 divergence, 0 warp serialisation (but shared memory is
used), no use of constants.

In contrast the kernel I want to use, works out at about 51 instructions per microsecond.

Does anyone else have figures they are prepared to share?
Bill

ps Under Centos computeprof help seems to create an assistant process which when computeprof is exited
often goes into a cpu hogging loop and has to be kill PID by hand.

pps I should have said the above is for 32 threads per block. It increases to 374 instructions per microsecond
with 96 threads per block (and the same for 128, 256 and 512).

Sorry, could not get “attachments” to work. Instead the picture is here
http://www.cs.ucl.ac.uk/staff/W.Langdon/gpu_gp_1/295GTX_compute.gif
Bill