Using gprof with CUDA Can this profiler be used with liunx c and CUDA?

newport_j · April 5, 2010, 2:22pm

i am currently using gprof when I want to profile a linux c program. I have ID’d some functions (subprograms) in this system that would be amenable to CUDA rewriting. It seems that the CUDA profiler only works with CUDA programs and gprof only works with c programs.

Is there a way to get gprof to work with a program that is mostly linux, but has a few programs that are in CUDA? I want to get an overall picture of the program’s operation not just c or CUDA separately. I have seen on the forum and in my google searches that gprof can work with part of the program written in CUDA.

How is this done and does gprof lose any of its functionality when it is done?

Newport_j

avidday · April 5, 2010, 4:11pm

No, gprof cannot be used to profile code running on the GPU. You can use both cuda profiling and gprof together, however:

avidday@cuda:~$ nvcc -Xcompiler "-g -pg" -arch=sm_13 -o pi pi.cu 

./pi.cu(41): Advisory: Loop was not unrolled, cannot deduce loop trip count

avidday@cuda:~$ CUDA_PROFILE=1 ./pi

0 0 3.141297500000

1 57600000 3.141359513889

2 115200000 3.141452916667

3 172800000 3.141435989583

4 230400000 3.141470236111

5 288000000 3.141491261574

6 345600000 3.141498273810

7 403200000 3.141496935764

8 460800000 3.141521033951

9 518400000 3.141504215278

avidday@cuda:~$ cat cuda_profile_1.log

# CUDA_PROFILE_LOG_VERSION 1.6

# CUDA_DEVICE 1 GeForce GTX 275

# TIMESTAMPFACTOR fffff733c90896d8

method,gputime,cputime,occupancy

method=[ memcpyHtoD ] gputime=[ 4.000 ] cputime=[ 2.000 ] 

method=[ memcpyHtoD ] gputime=[ 132894.719 ] cputime=[ 133219.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4064.640 ] cputime=[ 14.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.392 ] cputime=[ 21.000 ] 

method=[ memcpyHtoD ] gputime=[ 133782.656 ] cputime=[ 134101.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4061.600 ] cputime=[ 7.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.584 ] cputime=[ 22.000 ] 

method=[ memcpyHtoD ] gputime=[ 133944.547 ] cputime=[ 134277.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4060.448 ] cputime=[ 6.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.456 ] cputime=[ 21.000 ] 

method=[ memcpyHtoD ] gputime=[ 133849.594 ] cputime=[ 134171.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4058.368 ] cputime=[ 7.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.392 ] cputime=[ 20.000 ] 

method=[ memcpyHtoD ] gputime=[ 133888.797 ] cputime=[ 134206.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4061.664 ] cputime=[ 6.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.392 ] cputime=[ 22.000 ] 

method=[ memcpyHtoD ] gputime=[ 133583.688 ] cputime=[ 133899.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4063.904 ] cputime=[ 6.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.392 ] cputime=[ 22.000 ] 

method=[ memcpyHtoD ] gputime=[ 133613.859 ] cputime=[ 133921.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4057.984 ] cputime=[ 6.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.456 ] cputime=[ 21.000 ] 

method=[ memcpyHtoD ] gputime=[ 133725.469 ] cputime=[ 134040.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4059.744 ] cputime=[ 6.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.456 ] cputime=[ 20.000 ] 

method=[ memcpyHtoD ] gputime=[ 134345.766 ] cputime=[ 134658.016 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4058.976 ] cputime=[ 7.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.360 ] cputime=[ 21.000 ] 

method=[ memcpyHtoD ] gputime=[ 134613.188 ] cputime=[ 134934.000 ] 

method=[ _Z12withinCircleP6float2P6ulong2 ] gputime=[ 4064.352 ] cputime=[ 7.000 ] occupancy=[ 0.750 ] 

method=[ memcpyDtoH ] gputime=[ 3.488 ] cputime=[ 21.000 ] 

avidday@cuda:~$ gprof -C ./pi

/home/avidday/pi.cu:51: (_Z15generateSamplesjP6float2:0x41baec) 11 executions

/opt/cuda-3.0/bin/../include/cuda_runtime.h:760: (_Z10cudaLaunchIcE9cudaErrorPT_:0x41bfb5) 10 executions

/opt/cuda-3.0/bin/../include/vector_types.h:523: (_ZN4dim3C1Ejjj:0x41bf46) 20 executions

/tmp/tmpxft_00004446_00000000-1_pi.cudafe1.stub.c:4: (_Z41__static_initialization_and_destruction_0ii:0x410f17) 1 executions

/tmp/tmpxft_00004446_00000000-1_pi.cudafe1.stub.c:4: (_GLOBAL__I__Z15generateSamplesjP6float2:0x410f59) 1 executions

/tmp/tmpxft_00004446_00000000-1_pi.cudafe1.stub.c:4: (_ZL74__sti____cudaRegisterAll_37_tmpxft_00004446_0000000

0_4_pi_cpp1_ii_d2d32138v:0x410f73) 1 executions

/tmp/tmpxft_00004446_00000000-1_pi.cudafe1.stub.c:9: (_Z46__device_stub__Z12withinCircleP6float2P6ulong2P6floa

t2P6ulong2:0x41101d) 10 executions

/tmp/tmpxft_00004446_00000000-1_pi.cudafe1.stub.c:11: (_Z12withinCircleP6float2P6ulong2:0x411095) 10 executions

/usr/include/c++/4.3/iomanip:98: (_ZSt11setiosflagsSt13_Ios_Fmtflags:0x41bf7f) 10 executions

/usr/include/c++/4.3/iomanip:209: (_ZSt12setprecisioni:0x41bf9a) 10 executions

There you see a trivial test kernel compiled for host profiling, and then run with CUDA profiling enabled, which produces separate call statistics for both device and host code.

newport_j · April 5, 2010, 7:42pm

In the line:

nvcc -Xcompiler “-g -pg” -arch=sm_13 -o pi pi.cu

I know what everything means except -arch=sm_13. Google did not help.

What does it have to do with the compilation?

-g -pg is used in profiling in gnu, but why put them in quotes?

Newport_j

sagrailo · April 5, 2010, 7:53pm

The “-arch” flag is used to specify the name of the NVIDIA GPU architecture to compile to, and “sm_13” corresponds to CUDA Capability 1.3 devices, thus you’re free to change this to reflect the architecture of the device you’re using. The “-Xcompiler” flag is to specify options to be passed to so-called “host” compiler (gcc in this particular case), and this is compiler that is going to be used by nvcc to compile the code to be executed on the CPU, once when segments of the code to be compiled for the GPU extracted. The gprof related options to be passed to the gcc have to be quoted so that shell interprets them as single token, otherwise it would pass then as two tokens to nvcc, which would in turn think that “-g” should be passed to gcc, and “-pg” is an option intended to itself.

avidday · April 5, 2010, 7:57pm

A search in the cuda documentation will. But to save you the trouble, it means compile for the compute 1.3 architecture. Quoting directly from the output of nvcc:

--gpu-architecture <gpu architecture name>  (-arch)						   

		Specify the name of the class of nVidia GPU architectures for which the cuda

		input files must be compiled.

		With the exception as described for the shorthand below, the architecture

		specified with this option must be a virtual architecture (such as compute_10),

		and it will be the assumed architecture during the nvopencc compilation stage.

		This option will cause no code to be generated (that is the role of nvcc

		option '--gpu-code', see below); rather, its purpose is to steer the nvopencc

		stage, influencing the architecture of the generated ptx intermediate.

		For convenience in case of simple nvcc compilations the following shorthand

		is supported: if no value for option '--gpu-code' is specified, then the

		value of this option defaults to the value of '--gpu-architecture'. In this

		situation, as only exception to the description above, the value specified

		for '--gpu-architecture' may be a 'real' architecture (such as a sm_13),

		in which case nvcc uses the closest virtual architecture as effective architecture

		value. For example, 'nvcc -arch=sm_13' is equivalent to 'nvcc -arch=compute_13

		-code=sm_13'.

		Allowed values for this option:  'compute_10','compute_11','compute_12','compute_13',

		'compute_20','sm_10','sm_11','sm_12','sm_13','sm_20'.

Because I want to ensure both of them passed to gcc when it compiles host code, and that it the way to achieve it. It ensures both are treated as an argument to the Xcompiler nvcc argument.

mikemac17 · April 9, 2013, 2:41am

When using the CUDA_PROFILE=1 flag, am I able to define where the log is stored?

DrAnderson42 · April 9, 2013, 3:18pm

Try nvprof. It is new in CUDA 5 and provides an experience similar to gprof for CUDA kernels - useful for quick benchmarks when you don’t need all the capabilities of the visual profiler.

Topic		Replies	Views
CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler Technical Blog	35	2458	September 5, 2021
first install of cuda CUDA Setup and Installation	6	7639	February 12, 2017
nvprof never returns CUDA Programming and Performance	8	6311	March 30, 2016
NVProf error on samples CUDA Programming and Performance	28	20455	December 29, 2020
Ubuntu 20.04, GCC 9.3, Cuda Toolkit 11.3 - not a supported combination? CUDA Programming and Performance	11	8955	November 4, 2021
What is the defferent between"GPU activities" and "API calls"? Legacy PGI Compilers	3	3193	June 4, 2019
Can the program compiled with NVCC run a machine without the GPU card? CUDA Programming and Performance	15	10900	June 15, 2011
nvprof core dumps on Ubuntu 16.04 CUDA Setup and Installation	12	3565	August 16, 2018
Profiling only partially works nvc, nvc++ and nvfortran	10	1352	July 21, 2020
NVCC forces c++ compilation of .cu files CUDA Programming and Performance	11	25651	December 11, 2011

Using gprof with CUDA Can this profiler be used with liunx c and CUDA?

Related topics