i am currently using gprof when I want to profile a linux c program. I have ID’d some functions (subprograms) in this system that would be amenable to CUDA rewriting. It seems that the CUDA profiler only works with CUDA programs and gprof only works with c programs.
Is there a way to get gprof to work with a program that is mostly linux, but has a few programs that are in CUDA? I want to get an overall picture of the program’s operation not just c or CUDA separately. I have seen on the forum and in my google searches that gprof can work with part of the program written in CUDA.
How is this done and does gprof lose any of its functionality when it is done?
Newport_j
No, gprof cannot be used to profile code running on the GPU. You can use both cuda profiling and gprof together, however:
avidday@cuda:~$ nvcc -Xcompiler "-g -pg" -arch=sm_13 -o pi pi.cu
./pi.cu(41): Advisory: Loop was not unrolled, cannot deduce loop trip count
avidday@cuda:~$ CUDA_PROFILE=1 ./pi
0 0 3.141297500000
1 57600000 3.141359513889
2 115200000 3.141452916667
3 172800000 3.141435989583
4 230400000 3.141470236111
5 288000000 3.141491261574
6 345600000 3.141498273810
7 403200000 3.141496935764
8 460800000 3.141521033951
9 518400000 3.141504215278
avidday@cuda:~$ cat cuda_profile_1.log
# CUDA_PROFILE_LOG_VERSION 1.6
# CUDA_DEVICE 1 GeForce GTX 275
# TIMESTAMPFACTOR fffff733c90896d8
method,gputime,cputime,occupancy
method=[ memcpyHtoD ] gputime=[ 4.000 ] cputime=[ 2.000 ]
method=[ memcpyHtoD ] gputime=[ 132894.719 ] cputime=[ 133219.000 ]
method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4064.640 ] cputime=[ 14.000 ] occupancy=[ 0.750 ]
method=[ memcpyDtoH ] gputime=[ 3.392 ] cputime=[ 21.000 ]
method=[ memcpyHtoD ] gputime=[ 133782.656 ] cputime=[ 134101.000 ]
method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4061.600 ] cputime=[ 7.000 ] occupancy=[ 0.750 ]
method=[ memcpyDtoH ] gputime=[ 3.584 ] cputime=[ 22.000 ]
method=[ memcpyHtoD ] gputime=[ 133944.547 ] cputime=[ 134277.000 ]
method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4060.448 ] cputime=[ 6.000 ] occupancy=[ 0.750 ]
method=[ memcpyDtoH ] gputime=[ 3.456 ] cputime=[ 21.000 ]
method=[ memcpyHtoD ] gputime=[ 133849.594 ] cputime=[ 134171.000 ]
method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4058.368 ] cputime=[ 7.000 ] occupancy=[ 0.750 ]
method=[ memcpyDtoH ] gputime=[ 3.392 ] cputime=[ 20.000 ]
method=[ memcpyHtoD ] gputime=[ 133888.797 ] cputime=[ 134206.000 ]
method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4061.664 ] cputime=[ 6.000 ] occupancy=[ 0.750 ]
method=[ memcpyDtoH ] gputime=[ 3.392 ] cputime=[ 22.000 ]
method=[ memcpyHtoD ] gputime=[ 133583.688 ] cputime=[ 133899.000 ]
method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4063.904 ] cputime=[ 6.000 ] occupancy=[ 0.750 ]
method=[ memcpyDtoH ] gputime=[ 3.392 ] cputime=[ 22.000 ]
method=[ memcpyHtoD ] gputime=[ 133613.859 ] cputime=[ 133921.000 ]
method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4057.984 ] cputime=[ 6.000 ] occupancy=[ 0.750 ]
method=[ memcpyDtoH ] gputime=[ 3.456 ] cputime=[ 21.000 ]
method=[ memcpyHtoD ] gputime=[ 133725.469 ] cputime=[ 134040.000 ]
method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4059.744 ] cputime=[ 6.000 ] occupancy=[ 0.750 ]
method=[ memcpyDtoH ] gputime=[ 3.456 ] cputime=[ 20.000 ]
method=[ memcpyHtoD ] gputime=[ 134345.766 ] cputime=[ 134658.016 ]
method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4058.976 ] cputime=[ 7.000 ] occupancy=[ 0.750 ]
method=[ memcpyDtoH ] gputime=[ 3.360 ] cputime=[ 21.000 ]
method=[ memcpyHtoD ] gputime=[ 134613.188 ] cputime=[ 134934.000 ]
method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4064.352 ] cputime=[ 7.000 ] occupancy=[ 0.750 ]
method=[ memcpyDtoH ] gputime=[ 3.488 ] cputime=[ 21.000 ]
avidday@cuda:~$ gprof -C ./pi
/home/avidday/pi.cu:51: (_Z15generateSamplesjP6float2:0x41baec) 11 executions
/opt/cuda-3.0/bin/../include/cuda_runtime.h:760: (_Z10cudaLaunchIcE9cudaErrorPT_:0x41bfb5) 10 executions
/opt/cuda-3.0/bin/../include/vector_types.h:523: (_ZN4dim3C1Ejjj:0x41bf46) 20 executions
/tmp/tmpxft_00004446_00000000-1_pi.cudafe1.stub.c:4: (_Z41__static_initialization_and_destruction_0ii:0x410f17) 1 executions
/tmp/tmpxft_00004446_00000000-1_pi.cudafe1.stub.c:4: (_GLOBAL__I__Z15generateSamplesjP6float2:0x410f59) 1 executions
/tmp/tmpxft_00004446_00000000-1_pi.cudafe1.stub.c:4: (_ZL74__sti____cudaRegisterAll_37_tmpxft_00004446_0000000
0_4_pi_cpp1_ii_d2d32138v:0x410f73) 1 executions
/tmp/tmpxft_00004446_00000000-1_pi.cudafe1.stub.c:9: (_Z46__device_stub__Z12withinCircleP6float2P6ulong2P6floa
t2P6ulong2:0x41101d) 10 executions
/tmp/tmpxft_00004446_00000000-1_pi.cudafe1.stub.c:11: (_Z12withinCircleP6float2P6ulong2:0x411095) 10 executions
/usr/include/c++/4.3/iomanip:98: (_ZSt11setiosflagsSt13_Ios_Fmtflags:0x41bf7f) 10 executions
/usr/include/c++/4.3/iomanip:209: (_ZSt12setprecisioni:0x41bf9a) 10 executions
There you see a trivial test kernel compiled for host profiling, and then run with CUDA profiling enabled, which produces separate call statistics for both device and host code.
In the line:
nvcc -Xcompiler “-g -pg” -arch=sm_13 -o pi pi.cu
I know what everything means except -arch=sm_13. Google did not help.
What does it have to do with the compilation?
-g -pg is used in profiling in gnu, but why put them in quotes?
Newport_j
The “-arch” flag is used to specify the name of the NVIDIA GPU architecture to compile to, and “sm_13” corresponds to CUDA Capability 1.3 devices, thus you’re free to change this to reflect the architecture of the device you’re using. The “-Xcompiler” flag is to specify options to be passed to so-called “host” compiler (gcc in this particular case), and this is compiler that is going to be used by nvcc to compile the code to be executed on the CPU, once when segments of the code to be compiled for the GPU extracted. The gprof related options to be passed to the gcc have to be quoted so that shell interprets them as single token, otherwise it would pass then as two tokens to nvcc, which would in turn think that “-g” should be passed to gcc, and “-pg” is an option intended to itself.
A search in the cuda documentation will. But to save you the trouble, it means compile for the compute 1.3 architecture. Quoting directly from the output of nvcc:
--gpu-architecture <gpu architecture name> (-arch)
Specify the name of the class of nVidia GPU architectures for which the cuda
input files must be compiled.
With the exception as described for the shorthand below, the architecture
specified with this option must be a virtual architecture (such as compute_10),
and it will be the assumed architecture during the nvopencc compilation stage.
This option will cause no code to be generated (that is the role of nvcc
option '--gpu-code', see below); rather, its purpose is to steer the nvopencc
stage, influencing the architecture of the generated ptx intermediate.
For convenience in case of simple nvcc compilations the following shorthand
is supported: if no value for option '--gpu-code' is specified, then the
value of this option defaults to the value of '--gpu-architecture'. In this
situation, as only exception to the description above, the value specified
for '--gpu-architecture' may be a 'real' architecture (such as a sm_13),
in which case nvcc uses the closest virtual architecture as effective architecture
value. For example, 'nvcc -arch=sm_13' is equivalent to 'nvcc -arch=compute_13
-code=sm_13'.
Allowed values for this option: 'compute_10','compute_11','compute_12','compute_13',
'compute_20','sm_10','sm_11','sm_12','sm_13','sm_20'.
Because I want to ensure both of them passed to gcc when it compiles host code, and that it the way to achieve it. It ensures both are treated as an argument to the Xcompiler nvcc argument.
When using the CUDA_PROFILE=1 flag, am I able to define where the log is stored?
Try nvprof. It is new in CUDA 5 and provides an experience similar to gprof for CUDA kernels - useful for quick benchmarks when you don’t need all the capabilities of the visual profiler.