TK1 very slow GPU initialization

Hi

I’m working on a arch-Linux based image for TK1 (apalis-tk1 by toradex) its basically a Jetson-TK1. The system boots, modules load, firmware appears to load - all seems good, however after boot when running any of the CUDA samples the first run takes very long time, meanwhile no CPU activity. I’ve identified the culprit “cuInit” takes about 4 minutes to return, but after the first run it performs the same as on the ubuntu-jetpack image.

using strace, shows the time in cuInit is spent in ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x41, 0x2, 0x18) = 0 where in ubuntu the the call looks a bit different ioctl(4, AGPIOC_RELEASE or APM_IOC_SUSPEND or SNDRV_PCM_IOCTL_TSTAMP, 0xbec70ef0) = 0

running as root makes no difference, so its not a permissions issue

ls -l /dev/nv* shows the same device handles on both systems (arch/ubuntu)

tegra init script is running doing boot

dmesg shows similar output

Kernel is the same, i.e build from the same source checkout using the same kernel configuration.

Question is, am I missing something for initialization or could there be another explanation. Seems this its not an uncommon issue on both tegra and at least in past on desktop GPUs. However slow CUDA init usually refers to a couple of seconds not 4 minutes.

[alarm@alarm ~]$ time ./deviceQuery (or deviceQueryDrv same result)
./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GK20A”
CUDA Driver Version / Runtime Version 6.5 / 6.5
CUDA Capability Major/Minor version number: 3.2
Total amount of global memory: 1924 MBytes (2017759232 bytes)
( 1) Multiprocessors, (192) CUDA Cores/MP: 192 CUDA Cores
GPU Clock rate: 852 MHz (0.85 GHz)
Memory Clock rate: 924 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 131072 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 6.5, NumDevs = 1, Device0 = GK20A
Result = PASS

real 4m0.953s
user 0m0.025s
sys 0m0.130s

second and further runs give ~
real 0m0.070s
user 0m0.013s
sys 0m0.033s

Hi,

Have you maximized the device performance first?

sudo ./jetson_clocks.sh

If you still meet this issue after maximizing the clocks, could you share the profiling result with us?

nvprof [app]

Thanks.

Hi

thanks for the input.

First - I’m not sure it was clear in the first post, but running on the Ubuntu image there is no performance penalty on the first run.

I have no jetson_clocks.sh in the system neither in the arch or the Ubuntu system.

however exploring your hint a bit further, from /sys/kernel/debug/… i can see the GPU on both systems are running at the same frequency. This should indicate its not a clock issue, unless changing it triggers something else than changing the clock itself ?

Next step using this script, assume its the same as jetson_clocks.sh
https://github.com/dusty-nv/jetson-scripts/blob/master/jetson_max_l4t.sh

GPU clocks should be the important part in this issue.
however before the first run of a CUDA program I can’t write
to “/sys/kernel/debug/clock/override.gbus/rate” afterwards I can, this is the case on both the ubuntu and the arch system.

“/sys/kernel/debug/clock/override.emc/rate” i can write both before and after first cuda run.

Going the other way, trying to minimize the clocks on the ubuntu system - this puts the time spent up to 1.5 sec. (cant do this test on the first run due to the override.gbus/rate issue)

profiling on the arch system:
time nvprof ./NVIDIA_CUDA-6.5_Samples/1_Utilities/deviceQueryDrv/deviceQueryDrv

before first run and clocks not maximized
==328== Profiling application: NVIDIA_CUDA-6.5_Samples/1_Utilities/deviceQueryDrv/deviceQueryDrv
==328== Profiling result:
No kernels were profiled.

==328== API calls:
Time(%) Time Calls Avg Min Max Name
91.70% 214.42us 42 5.1050us 1.0000us 96.667us cuDeviceGetAttribute
3.85% 9.0000us 2 4.5000us 1.5000us 7.5000us cuDeviceGetCount
1.50% 3.5000us 1 3.5000us 3.5000us 3.5000us cuDeviceTotalMem
1.07% 2.5000us 1 2.5000us 2.5000us 2.5000us cuDeviceGetName
0.82% 1.9170us 1 1.9170us 1.9170us 1.9170us cuDeviceComputeCapability
0.53% 1.2500us 1 1.2500us 1.2500us 1.2500us cuDriverGetVersion
0.53% 1.2500us 1 1.2500us 1.2500us 1.2500us cuDeviceGet

real 4m1.636s
user 0m1.941s
sys 0m9.888s

after first run and clocks maximized
==596== Profiling application: ./NVIDIA_CUDA-6.5_Samples/1_Utilities/deviceQueryDrv/deviceQueryDrv
==596== Profiling result:
No kernels were profiled.

==596== API calls:
Time(%) Time Calls Avg Min Max Name
91.15% 184.42us 42 4.3900us 1.0830us 69.667us cuDeviceGetAttribute
3.29% 6.6660us 2 3.3330us 1.6660us 5.0000us cuDeviceGetCount
2.06% 4.1660us 1 4.1660us 4.1660us 4.1660us cuDeviceTotalMem
1.36% 2.7500us 1 2.7500us 2.7500us 2.7500us cuDeviceGetName
0.99% 2.0000us 1 2.0000us 2.0000us 2.0000us cuDeviceComputeCapability
0.62% 1.2500us 1 1.2500us 1.2500us 1.2500us cuDeviceGet
0.54% 1.0830us 1 1.0830us 1.0830us 1.0830us cuDriverGetVersion

real 0m0.526s
user 0m0.056s
sys 0m0.382s

“jetson_clocks.sh” is found in “~ubuntu/jetson_clocks.sh” and “~nvidia/jetson_clocks.sh”. Since it requires sudo to run, I’m not sure why it is there. Probably “/usr/local/bin/” would be a better place.

Got that but searched the entire system, it’s not there.

Find / -name jetson_clocks.sh

It’s not there, but that’s not too important- I tried to do what the script is supposed to do, but the gbus variable is not writeable prior to the first cuda run. (See post above)

Both systems have the same GPU clock after boot, but only one has the time hit on first call to cuInit, this points to clocks not being the issue.

If jetson_clocks.sh is missing, then the install was probably incomplete (this file should always be present in the home directories of both user “ubuntu” and user “nvidia”). Do you get any errors listed from:

sha1sum -c /etc/nv_tegra_release

Not sure there is an “nvidia” user - it’s a jetson based board so there might be some slight differences in the image. If you have a TK1, could you check which package the file belongs to?

dpkg -S /path/to/file

Thanks

Don’t have the board with me right now. But do know there is all the library files listed with checksum s. I verifyed it earlier, but will recheck later though I don’t see how a missing or wrong library would make it slow on the first cuInit.? With a missing or broken library it should not work at all.

verified

sha1sum -c /etc/nv_tegra_release

all OK

FYI, the “Linux for Tegra” (L4T) is a stock Ubuntu system with some NVIDIA-specific hardware access files added to it. When flashing on command line it is the “apply_binaries.sh” step which adds those files. The “sha1sum -c /etc/nv_tegra_release” tool verifies many of the files which might have been overwritten by outside packages. Regardless of carrier board, if you use L4T (and everything Jetson uses L4T), then you have those files. The “nvidia” account is in “/home/nvidia/”, and is added for all L4T releases. If this is missing, then your L4T release is incomplete. Home directories can be named with “~” and the name, so “~ubuntu/” is equal to “/home/ubuntu/”, and “~nvidia” is equivalent to “/home/nvidia/”. “jetson_clocks.sh” should exist in the home directory of both “ubuntu” and “nvidia”. I find this location odd, and wish it was moved into “/usr/local/bin/” instead. “dpkg” will have no knowledge of this file since it isn’t added via the package manager.

So we know your basic install is valid, at least for the default files.

You will still need to run jetson_clocks.sh before any kind of benchmark. Normally the Jetson will boot with some energy savings enabled. Running “jetson_clocks.sh” without arguments will set the clocks to max and disable energy savings. In some cases of energy savings (don’t know about this particular case) the DVFS tables may have a lower max clock speed than if jetson_clocks.sh is run. A higher max clock could occur and speeds could go beyond your previous max clock.

I just realized, after seeing Honey_Patouceul’s post, the TK1 predates “jetson_clocks.sh” (and early releases won’t even have the “nvidia” account). There are other ways of boosting clocks though. Adding some scripts I’ve used in the past which do similar functions…you’ll need to remove the “.txt” suffix.
performance_ls.txt (447 Bytes)
performance_set_default.txt (736 Bytes)
performance_set_max.txt (736 Bytes)

Which version are you running ?

head -n /etc/nv_tegra_release

Not sure which version if any jetson_clocks.sh appeared for TK1, but be sure to use latest L4T release for your TK1 if possible (more bugs fixed).
If not available, similar scripts for boosting clocks as you mentionned may do the same.

What is not clear is why you are profiling deviceQuery ?

Anyway be aware for profiling that first iteration may not be representative of the final performance. Thnigs take time to load and run at first, but then will be faster and even faster if cache gets useful…So take a significant number of loops (> 100), and compare first ones with average, you may see differences.

It seems the isse is not clear

I have two identical boards one with stock ubuntu image, the other with an arch based image created from arch rootfs, same kernel (build from same source and config on both systems)

The issue is the Arch stalls for 4 minuts on the first call to the cuInit, the Ubuntu system performs normal every time. The issue is not performance of CUDA itself apart from the first init they perform identical.

@linuxdev those are only for the cpu clock, regardless the clock settings for cpu&GPU are identical on the two boards. the CPU has no workload during the 4 minutes, tegrastats stalls until cuInit is done => don’t think it has anything to do with clock’s

@Honey_Patouceul
its R21.7

What is not clear is why you are profiling deviceQuery ? -> because AastaLLL asked for it.

Could difference in glibc version be the issue ? - I can see from valgrind a lot more symbol lookups are being done on the arch system.

view files with:

kcachegrind filename

callgrind_arch.out.txt (79.3 KB)
callgrind_ubuntu.out.txt (157 KB)

Nobody here has Arch Linux, so you’ll find a desire to make as much as possible a match to the known Ubuntu/L4T setup.

The CPU sets up everything, including talking to all of the libraries you see profiled. Even if your clocks are identical on the two boards, there may be other software or settings you don’t see which trigger depending on conditions (e.g., energy savings software might monitor DVFS table settings). When you forcibly max out CPU cores you know those other profiles are not running. Without the CPU maxed out you only know the core clocks are the same, you don’t know what else might be triggering. Some software is designed to alter performance depending on temperature or current clock settings…Arch and L4T will not have the same behavior at intermediate DVFS settings because this isn’t determined by the kernel in the case of matching kernels…the surrounding software matters and making sure cores are maxed avoids having that question. Maybe they are the same without maxed core clocks…but you won’t know for certain. Thus maxing out clocks.

Consider also that if you were to purposely lower the clock somewhat, and compare time to when clocks are maxed, then if the time remains the same you’ve just proven you were waiting on a timer, i.e., the delay was intentional via software set to a specific time. If time delays are longer with slower clocks, and shorter with purposely increased clock settings, then you know the bottleneck is CPU and perhaps not a timeout via a clock (intentional timeouts will be set by elapsed time, but bottlenecks will have time determined via CPU cycles…elapsed time remains the same at two different clock settings, CPU cycle time changes).

One thing which stick out is that dynamic library symbol lookup times are about 10 seconds longer in Arch. Actual time spent in libcuda.so.1.1 is a bit longer in L4T despite the delay for CUDA init. It isn’t a smoking gun, but it looks like the extra time might be from the linker. I have not looked at the call graph, but I will suggest going to the symbol lookup and examine the call graph to see why linking is taking longer in one versus the other. There isn’t anything you can do with CUDA until the linker is finished.