GPU memory bandwidth isolation

woosungkang · June 14, 2024, 7:58am

Hello everyone,

I’m currently attempting to overlap the memory copy operations between CPU and GPU with GPU kernel execution. However, I’ve noticed that doing so causes delays in both operations, as discussed in this forum thread.

Is there any API or technique available to isolate the bandwidth between memory copy operations and GPU kernel execution in order to maintain the original performance?

thanks

Robert_Crovella · June 14, 2024, 1:46pm

The original performance is presumably the kernel duration when it is running without any other activity. It is probably a memory bound code that is using approximately the full memory bandwidth. Then when you add a copy operation, some of the memory bandwidth gets used by the copy operation so the kernel runs slower.

No, there is no hidden reserve of memory bandwidth that you can tap into, so that the kernel can run as if it had full bandwidth, even when some other activity is using some of the bandwidth. And otherwise partitioning or isolation would simply reduce the bandwidth to the kernel anyway.

Robert_Crovella · June 14, 2024, 4:15pm

If you really want the kernel to be at full speed as a top priority, then put the copy operations in question in the same stream as the kernel. That will force those copy operations to run before, or after, but not during the kernel execution.

Curefab · June 14, 2024, 4:46pm

You can always reduce the needed memory bandwidth (not only for copies, but also during the kernel run) by reading and writing in an artificially small range of memory only, so that your working set fully fits into the L2 cache. Of course you would not get the correct numeric results as output, but it could help for determining bottlenecks or theoretical speed numbers for the isolated computational part, if that is what you want.

woosungkang · June 17, 2024, 5:02am

Thank you for the response.

Regarding this, I want to estimate how much GPU bandwidth my kernel is using.
For example, let’s assume a simple vector addition kernel that calculates a + b = c.
If each of a, b, and c is an int vector corresponding to 50MB and this kernel takes 0.2ms,
is it reasonable to say that this kernel uses approximately 150MB * (1000/0.2) = 750000MB = 750GB/s of bandwidth?

Curefab · June 17, 2024, 7:08am

That sounds right. You can check with Compute Nsight, too. Also when talking about GB, the difference between GB and GiB is 8%. But feel free to ignore that for a first estimation.

There are some interfaces, which can do either read or write, and some interfaces which can do both in parallel.
Global memory within the GPU has to do read and write accesses serially, PCIe communication to/from the host can do full-duplex concurrent reads and writes. Consider this distinction, when comparing to official numbers.

a and b is read, c is write.

system · July 1, 2024, 7:09am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9729	September 22, 2007
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
How can we efficiently perform batch copy from CPU to GPU, initiated by the CPU?or using an asynchronous approach CUDA Programming and Performance cuda	6	4184	June 17, 2023
Diff. between CPU / GPU kernel execution time CUDA Programming and Performance	4	1653	May 18, 2010
Continuing global memory output between kernels CUDA Programming and Performance	2	489	August 23, 2019
Fast sequental access in GPU memory please, advise me a method! CUDA Programming and Performance	5	4027	August 11, 2008
Measuring Kernel Bandwidth CUDA Programming and Performance	6	2280	September 21, 2010
CPU vs GPU performance CUDA Programming and Performance	3	482	December 16, 2018
Simple test, unexpected results: more calculations in each thread, less GPU occupancy time! CUDA Programming and Performance	5	1127	May 27, 2013
Boosting Application Performance with GPU Memory Access Tuning Technical Blog	12	920	March 25, 2023

GPU memory bandwidth isolation

Related topics