I donot undestand Jetson TX2 well. So, I am happy ,anyone answer this question.
When I run any program, it takes longer to use GPU and CPU than CPU only. I wonder why this happens.
I think the cause is that GPU usage is very low. For example,10〜20%. And more, I want to know the meaning of memory indicated by gtop . Is memory the sum of GPU memory and CPU memory? If it is true, Is Jetson TX2’s GPU memory and CPU memory in the same place?
Embedded GPUs all share memory with CPU. A more likely cause of slowing down though is that a GPU is designed to run a large batch of operations, not individual operations. You can run individual operations, but the latency to load and run and then send back to the CPU would make it seem slower. On the other hand, consider that there is no latency increase if you run a batch of 128 operations on the GPU, but if you ran 128 on CPU, then it would be a dramatic slowdown. It might depend on your data whether this pays off or not.
There are different ways of allocating memory, e.g., shared memory between GPU and CPU which can minimize slowdowns after allocation is complete.
Incidentally, most laptop GPUs, even for “gaming” laptops, also share memory between system and GPU without any dedicated video RAM. For operations remaining within VRAM there is an enormous advantage to speed, but the TX2 does integrate the GPU directly to the memory controller. This means there is no PCI bus to bottleneck and actual transfer could be faster than over PCI.
Thank you for answering me. I understand GPUs all share memory with CPU, but I would like to ask one question.
I wonder why the latency to load and run and then send back to the CPU would make it seem slower. I think it would not happen,
if GPUs all share memory with CPU.(Maybe my understanding is immature.)
When I run LSTM, CPU-only is about 100 seconds shorter than GPU and CPU. Maybe this is because of this LSTM program.
Keep in mind that embedded systems are generally using slower memory than a desktop PC. I can’t say this is the particular case, but there are very few embedded systems which can actually compete with the data throughput of a PC.
I don’t know enough to offer optimization advice, but there is more than one possible memory model, and the best one will depend on the nature of what you are doing. Having your program use the right number of threads/cores to get the most out of one batch of CUDA operations can make a huge difference in how the overhead of those memory transfers are spread out. In the case of a pipeline of operations it is rather common that initial frames take longer than the average number of frames once the pipeline is started. So for example, if the Jetson is picking up from lower power and building up to higher performance modes, then there will be a lag which drops with later memory loads (starting with maximizing jetson_clocks prior to operation could help by forcing to high performance mode before the actual use of data).
I couldn’t do it, but if you describe in detail the nature of the data (the work flow between CPU and GPU), when you measured the delay, and how much data passes through in total, then someone may be able to offer advice in terms of how to optimize.