I cannot use GPU well

gbsfkpper61212 · June 11, 2019, 11:42am

I donot undestand Jetson TX2 well. So, I am happy ,anyone answer this question.
When I run any program, it takes longer to use GPU and CPU than CPU only. I wonder why this happens.
I think the cause is that GPU usage is very low. For example,10〜20%. And more, I want to know the meaning of memory indicated by gtop . Is memory the sum of GPU memory and CPU memory? If it is true, Is Jetson TX2’s GPU memory and CPU memory in the same place?

linuxdev · June 11, 2019, 7:46pm

Embedded GPUs all share memory with CPU. A more likely cause of slowing down though is that a GPU is designed to run a large batch of operations, not individual operations. You can run individual operations, but the latency to load and run and then send back to the CPU would make it seem slower. On the other hand, consider that there is no latency increase if you run a batch of 128 operations on the GPU, but if you ran 128 on CPU, then it would be a dramatic slowdown. It might depend on your data whether this pays off or not.

There are different ways of allocating memory, e.g., shared memory between GPU and CPU which can minimize slowdowns after allocation is complete.

Incidentally, most laptop GPUs, even for “gaming” laptops, also share memory between system and GPU without any dedicated video RAM. For operations remaining within VRAM there is an enormous advantage to speed, but the TX2 does integrate the GPU directly to the memory controller. This means there is no PCI bus to bottleneck and actual transfer could be faster than over PCI.

gbsfkpper61212 · June 13, 2019, 10:11am

Thank you for answering me. I understand GPUs all share memory with CPU, but I would like to ask one question.
I wonder why the latency to load and run and then send back to the CPU would make it seem slower. I think it would not happen,
if GPUs all share memory with CPU.(Maybe my understanding is immature.)

When I run LSTM, CPU-only is about 100 seconds shorter than GPU and CPU. Maybe this is because of this LSTM program.

linuxdev · June 13, 2019, 2:36pm

Keep in mind that embedded systems are generally using slower memory than a desktop PC. I can’t say this is the particular case, but there are very few embedded systems which can actually compete with the data throughput of a PC.

I don’t know enough to offer optimization advice, but there is more than one possible memory model, and the best one will depend on the nature of what you are doing. Having your program use the right number of threads/cores to get the most out of one batch of CUDA operations can make a huge difference in how the overhead of those memory transfers are spread out. In the case of a pipeline of operations it is rather common that initial frames take longer than the average number of frames once the pipeline is started. So for example, if the Jetson is picking up from lower power and building up to higher performance modes, then there will be a lag which drops with later memory loads (starting with maximizing jetson_clocks prior to operation could help by forcing to high performance mode before the actual use of data).

I couldn’t do it, but if you describe in detail the nature of the data (the work flow between CPU and GPU), when you measured the delay, and how much data passes through in total, then someone may be able to offer advice in terms of how to optimize.

Topic		Replies	Views
Integrated GPU sharing Host Memory in TK1? Jetson TK1	1	1926	March 29, 2016
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1732	October 18, 2021
Relation between Utilization of CPU and GPU in HPC and Potential Reasons for Not Achieving Stable GPU Utilization CUDA Programming and Performance	1	421	October 7, 2022
Unified Memory on Jetson TK1 Jetson TK1	2	1535	February 15, 2016
The memory sharing between cpu and gpu in Jetson TX2 Jetson TX2	6	7174	October 18, 2021
Allocate memory in GPU instead of CPU Jetson TX2	4	933	June 25, 2018
Jetson Tx2 Memory? Jetson TX2 kernel	2	431	October 18, 2021
CPU RAM vs GPU RAM Jetson Nano kernel	8	2968	October 18, 2021
Optimising GPU and CPU memory transfer time (CUDA/Hardware)? CUDA Programming and Performance hw , cuda	8	4256	January 7, 2022
Run time improvement of a few AI algorithms running together on a single GPU Jetson AGX Xavier gpu	3	423	October 18, 2021

I cannot use GPU well

Related topics