Using gprof with CUDA Can this profiler be used with liunx c and CUDA?

i am currently using gprof when I want to profile a linux c program. I have ID’d some functions (subprograms) in this system that would be amenable to CUDA rewriting. It seems that the CUDA profiler only works with CUDA programs and gprof only works with c programs.

Is there a way to get gprof to work with a program that is mostly linux, but has a few programs that are in CUDA? I want to get an overall picture of the program’s operation not just c or CUDA separately. I have seen on the forum and in my google searches that gprof can work with part of the program written in CUDA.

How is this done and does gprof lose any of its functionality when it is done?

Newport_j

No, gprof cannot be used to profile code running on the GPU. You can use both cuda profiling and gprof together, however:

avidday@cuda:~$ nvcc -Xcompiler "-g -pg" -arch=sm_13 -o pi pi.cu 

./pi.cu(41): Advisory: Loop was not unrolled, cannot deduce loop trip count

avidday@cuda:~$ CUDA_PROFILE=1 ./pi

0 0 3.141297500000

1 57600000 3.141359513889

2 115200000 3.141452916667

3 172800000 3.141435989583

4 230400000 3.141470236111

5 288000000 3.141491261574

6 345600000 3.141498273810

7 403200000 3.141496935764

8 460800000 3.141521033951

9 518400000 3.141504215278

avidday@cuda:~$ cat cuda_profile_1.log

# CUDA_PROFILE_LOG_VERSION 1.6

# CUDA_DEVICE 1 GeForce GTX 275

# TIMESTAMPFACTOR fffff733c90896d8

method,gputime,cputime,occupancy

method=[ memcpyHtoD ] gputime=[ 4.000 ] cputime=[ 2.000 ] 

method=[ memcpyHtoD ] gputime=[ 132894.719 ] cputime=[ 133219.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4064.640 ] cputime=[ 14.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.392 ] cputime=[ 21.000 ] 

method=[ memcpyHtoD ] gputime=[ 133782.656 ] cputime=[ 134101.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4061.600 ] cputime=[ 7.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.584 ] cputime=[ 22.000 ] 

method=[ memcpyHtoD ] gputime=[ 133944.547 ] cputime=[ 134277.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4060.448 ] cputime=[ 6.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.456 ] cputime=[ 21.000 ] 

method=[ memcpyHtoD ] gputime=[ 133849.594 ] cputime=[ 134171.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4058.368 ] cputime=[ 7.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.392 ] cputime=[ 20.000 ] 

method=[ memcpyHtoD ] gputime=[ 133888.797 ] cputime=[ 134206.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4061.664 ] cputime=[ 6.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.392 ] cputime=[ 22.000 ] 

method=[ memcpyHtoD ] gputime=[ 133583.688 ] cputime=[ 133899.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4063.904 ] cputime=[ 6.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.392 ] cputime=[ 22.000 ] 

method=[ memcpyHtoD ] gputime=[ 133613.859 ] cputime=[ 133921.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4057.984 ] cputime=[ 6.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.456 ] cputime=[ 21.000 ] 

method=[ memcpyHtoD ] gputime=[ 133725.469 ] cputime=[ 134040.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4059.744 ] cputime=[ 6.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.456 ] cputime=[ 20.000 ] 

method=[ memcpyHtoD ] gputime=[ 134345.766 ] cputime=[ 134658.016 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4058.976 ] cputime=[ 7.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.360 ] cputime=[ 21.000 ] 

method=[ memcpyHtoD ] gputime=[ 134613.188 ] cputime=[ 134934.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4064.352 ] cputime=[ 7.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.488 ] cputime=[ 21.000 ] 

avidday@cuda:~$ gprof -C ./pi

/home/avidday/pi.cu:51: (_Z15generateSamplesjP6float2:0x41baec) 11 executions

/opt/cuda-3.0/bin/../include/cuda_runtime.h:760: (_Z10cudaLaunchIcE9cudaErrorPT_:0x41bfb5) 10 executions

/opt/cuda-3.0/bin/../include/vector_types.h:523: (_ZN4dim3C1Ejjj:0x41bf46) 20 executions

/tmp/tmpxft_00004446_00000000-1_pi.cudafe1.stub.c:4: (_Z41__static_initialization_and_destruction_0ii:0x410f17) 1 executions

/tmp/tmpxft_00004446_00000000-1_pi.cudafe1.stub.c:4: (_GLOBAL__I__Z15generateSamplesjP6float2:0x410f59) 1 executions

/tmp/tmpxft_00004446_00000000-1_pi.cudafe1.stub.c:4: (_ZL74__sti____cudaRegisterAll_37_tmpxft_00004446_0000000

0_4_pi_cpp1_ii_d2d32138v:0x410f73) 1 executions

/tmp/tmpxft_00004446_00000000-1_pi.cudafe1.stub.c:9: (_Z46__device_stub__Z12withinCircleP6float2P6ulong2P6floa

t2P6ulong2:0x41101d) 10 executions

/tmp/tmpxft_00004446_00000000-1_pi.cudafe1.stub.c:11: (_Z12withinCircleP6float2P6ulong2:0x411095) 10 executions

/usr/include/c++/4.3/iomanip:98: (_ZSt11setiosflagsSt13_Ios_Fmtflags:0x41bf7f) 10 executions

/usr/include/c++/4.3/iomanip:209: (_ZSt12setprecisioni:0x41bf9a) 10 executions

There you see a trivial test kernel compiled for host profiling, and then run with CUDA profiling enabled, which produces separate call statistics for both device and host code.

In the line:

nvcc -Xcompiler “-g -pg” -arch=sm_13 -o pi pi.cu

I know what everything means except -arch=sm_13. Google did not help.

What does it have to do with the compilation?

-g -pg is used in profiling in gnu, but why put them in quotes?

Newport_j

The “-arch” flag is used to specify the name of the NVIDIA GPU architecture to compile to, and “sm_13” corresponds to CUDA Capability 1.3 devices, thus you’re free to change this to reflect the architecture of the device you’re using. The “-Xcompiler” flag is to specify options to be passed to so-called “host” compiler (gcc in this particular case), and this is compiler that is going to be used by nvcc to compile the code to be executed on the CPU, once when segments of the code to be compiled for the GPU extracted. The gprof related options to be passed to the gcc have to be quoted so that shell interprets them as single token, otherwise it would pass then as two tokens to nvcc, which would in turn think that “-g” should be passed to gcc, and “-pg” is an option intended to itself.

A search in the cuda documentation will. But to save you the trouble, it means compile for the compute 1.3 architecture. Quoting directly from the output of nvcc:

--gpu-architecture <gpu architecture name>  (-arch)						   

		Specify the name of the class of nVidia GPU architectures for which the cuda

		input files must be compiled.

		With the exception as described for the shorthand below, the architecture

		specified with this option must be a virtual architecture (such as compute_10),

		and it will be the assumed architecture during the nvopencc compilation stage.

		This option will cause no code to be generated (that is the role of nvcc

		option '--gpu-code', see below); rather, its purpose is to steer the nvopencc

		stage, influencing the architecture of the generated ptx intermediate.

		For convenience in case of simple nvcc compilations the following shorthand

		is supported: if no value for option '--gpu-code' is specified, then the

		value of this option defaults to the value of '--gpu-architecture'. In this

		situation, as only exception to the description above, the value specified

		for '--gpu-architecture' may be a 'real' architecture (such as a sm_13),

		in which case nvcc uses the closest virtual architecture as effective architecture

		value. For example, 'nvcc -arch=sm_13' is equivalent to 'nvcc -arch=compute_13

		-code=sm_13'.

		Allowed values for this option:  'compute_10','compute_11','compute_12','compute_13',

		'compute_20','sm_10','sm_11','sm_12','sm_13','sm_20'.

Because I want to ensure both of them passed to gcc when it compiles host code, and that it the way to achieve it. It ensures both are treated as an argument to the Xcompiler nvcc argument.

When using the CUDA_PROFILE=1 flag, am I able to define where the log is stored?

Try nvprof. It is new in CUDA 5 and provides an experience similar to gprof for CUDA kernels - useful for quick benchmarks when you don’t need all the capabilities of the visual profiler.