In NSight Eclipse, there seems to be a profile reference to ‘/usr/local/cuda/include/cudart’ which doesn’t exist. There is a directory ‘/usr/local/cuda/include/crt’. Should we either rename crt or merely add ‘…/include/cudart’?
I am not entirely sure what “profile” you mean. Is it an include path?
I think I have a similar, if not the exact same issue:
Description Location Type
Invalid project path: Include path not found (/usr/local/cuda/include/cudart). pathentry Path Entry Problem
My project builds and runs - but the results are much slower than if I were to just run my code in a terminal (running on Scientific Linux). I’m not sure if this is a separate issue or not.
That path entry should have absolutely no bearing on how your project is built - it is only used by IDE for things like code completion, etc.
Performance issue is really concerning. What do you mean by “running slower” - do you have any benchmarks? Are you sure you build it the same way under IDE and bash (i.e. do you pass -g -G flags when under the terminal to generate debug information). Note that console view in the IDE shows you how the build is done, please compare the flow to the commands you use to build your application from the command line.
Please note that the path issue is confirmed an being worked on. The path is actually coming from the nvcc.profile file residing in the same folder as nvcc itself. It is used both by IDE and when calling nvcc from the command line.
Thanks for the quick response.
It runs significantly slower in eclipse, for example I’ll share benchmarks for my first kernel (the rest see a similar slowdown):
In a terminal:
Average time (ms): 5.583680E-02
Average bandwidth (GB/s): 34.979169
Average time (ms): 3.214832E-01
Average bandwidth (GB/s): 6.075356
I’m not sure that its built the same way in both cases and that is likely my issue. I don’t pass anything special to the compiler, I just do the following:
When I built my project in eclipse, the console shows the following:
**** Build of configuration Debug for project ****
make: Nothing to be done for `all’.
**** Build Finished ****
When I go to the profiler it gives a warning saying that I have a “low compute utilization” of 0%, which makes sense given the results above. In the profiler view it lists the Tesla C2050 that the code should be running on and lists things like cudaMemcpy() and cudaEventCreate() but doesn’t list any of my kernels.
I’d greatly appreciate any Nsight you have
(1) The issue with the invalid path mentioned at the beginning of this thread is being looked at
(2) The performance difference would appear to be caused by using a debug build when building from Eclipse (per the build output shown), and a release build when building from the commandline (the compiler default is to build with full optimization).
Thanks, I appreciate the help.
So now when I use a release build:
**** Build of configuration Release for project finiteDifference_pencils ****
Building file: …/filename.cu
Invoking: NVCC Compiler
nvcc -O3 -gencode arch=compute_20,code=sm_20 -odir “” -M -o “projectname.d” “…/filename.cu”
nvcc --compile -O3 -gencode arch=compute_20,code=compute_20 -gencode arch=compute_20,code=sm_20 -o “filename.o” “…/filename.cu”
Finished building: …/filename.cu
Building target: projectname
Invoking: NVCC Linker
nvcc -link -o “projectname” ./filename.o
Finished building target: projectname
**** Build Finished ****
And my results are better, but still worse than that of the terminal:
Average time (ms): 7.707520E-02
Average bandwidth (GB/s): 25.340512
What I’m more concerned with, though, is how I cannot see any of my kernels in the profiler window. Any ideas as to why this might be the case?
I am not familiar with the Eclipse environment. How are you measuring the execution time? Is it possible it includes some startup cost that is larger when the app is running with Eclipse / Nsight? You might want to ask in the Parallel Nsight forum “next door”, as participants there are more likely to have insight into specific issues with this setup.
The only remaining difference between your build (nvcc
Please right-click your project, select “Properties”. In the dialog select Build/CUDA in the tree. Confirm it says “Release” in the drop-down in the top of the dialog. Uncheck all PTX and GPU code, press ok.
Clean your project and build it anew (make sure you use release configuration). Rerun your benchmark.
There is slight chance the difference you see is caused by the compiler (SM1.x and SM2.x are slightly different). If this is true indeed, we would appreciate if you could send us your CUDA code for inspection.
Alternatively, you could compile for SM2.0 from command line:
nvcc -O3 -gencode arch=compute_20,code=compute_20 -gencode arch=compute_20,code=sm_20 “…/filename.cu”
(Make sure path to your source file is proper). “-O3” switch is for tweaking optimization level.
Eclipse IDE does not incur any additional code when application is ran outside of the debugger (Ctrl+F11).
Can you try command-line profiler (nvprof ) to confirm your environment is properly configured?
The code I’m working with is actually a translation of some CUDA Fortran from Nvidia to CUDA C. I’ve copied the way in which they measure the execution time, which is simply to use cudaEventRecord() around a loop that calls a kernel a bunch of times and averages the results.
I went the route of changing the project properties and the benchmarks match now, thank you! It seems odd that SM2.0 would run slower than SM1.0.
You mean run the profiler in terminal where I ran the code originally before using eclipse? If so, see below
$ nvprof ./a.out
======== NVPROF is profiling a.out…
======== Command: a.out
(Output from my code)
======== Profiling result:
No kernels were profiled.
Seems odd. I’m new to profiling code, so thanks for bearing with me.
There are many possible explanations for this. Some examples. Different compilers are being used for sm_1x vs sm_2x/sm_3x, Open64 for sm_1x, NVVM (an LLVM-based compiler) for sm_2x and sm_3x. If this is single-precision code: sm_1x always uses flush-to-zero and approximate division, reciprocal, square root operations, while sm_2x/sm_3x by default provide denormal support and IEEE-rounded division, reciprocal, square root. These are slower. You can approximate sm_1x behavior by specifying -ftz=true -prec-sqrt=false -prec-div=false. If on a 64-bit platform: some optimizations on pointers that were valid for sm_1x since it was known that no sm_1x GPU has more than 4GB of memory are no longer applicable with sm_2x and higher that support GPUs with more than 4GB memory, leading to higher instruction count and increased register pressure.
In the nvcc.profile
TOP = $(HERE)/…
LD_LIBRARY_PATH += (TOP)/lib:(TOP)/extools/lib:
PATH += (TOP)/open64/bin:(TOP)/nvvm:$(HERE):
INCLUDES += “-I$(TOP)/include” “-I$(TOP)/include/cudart” $(SPACE)
LIBRARIES =+ (_SPACE_) "-L(TOP)/lib$(TARGET_SIZE)" -lcudart
… I changed the phrase in INCLUDES line “-I$(TOP)/include/cudart” to “-I$(TOP)/include/crt”, to conform with the directory structure of CUDA 5.
Can you try creating as new project in Nsight using the “CUDA Runtime Project” template (use File->New>CUDA C/C++ Project). Try building and profiling that.
One possibility is explained in the “Profiler User’s Guide” which will be released soon (but is not in the preview release):
To reduce profiling overhead, the profiling tools collect and
record profile information into internal buffers. These buffers are then flushed asynchronously to disk with low priority to avoid perturbing application behavior. To avoid losing profile information that has not yet been flushed, the application being profiled should call either cudaDeviceReset() before exiting. Doing so forces all buffer profile information to be flushed.
With the preview release I would also add cudaDeviceSynchronize() before the cudaDeviceReset() call.