Let’s start with the fact that I am fairly new to cuda so I might be missing something obvious!
So when i allocate a huge chunk of memory (7GB) at once, if i pass the kernel, pointers that point at the beginning of the allocated space, the kernel performs as expected. Thing is, when i pass pointers which point at the end of the allocated chunk, i get worse performance, sometimes up to 2x slower.
Cuda-memcheck isnt reporting any error and i even checked the pointers to see if they actually point to device memory. which they do. I have pasted below some dummy code that exhibits the problem. On my machine the first kernel invocation runs in about 5.6ms while the second one in 7.6ms. I cant think of any reason why the second one is slower. Any ideas?
I havent initialised the memory as i am just interested in testing performance but i have made sure, on another file, that both the original kernel and the dummy one produce the correct output.
I am using a gtx 1080, Cuda 11.2, Ubuntu 20.04.
Here’s a pastebin link to the code. Any help would be greatly appreciated!
It is entirely possible and not unusual for the runtime of memory-intensive code to vary somewhat depending on the addresses of data objects involved as memory access patterns interact with multiple mechanisms in a fairly complex memory hierarchy. It may change how the accesses distribute across memory banks for example, and how many bank conflicts result.
In a quick experiment the variations in your code based on address seem to be more like 10%, not 2x, though. For particular configurations, the CUDA profiler may allow you to pin-point relevant differences by drilling down on memory-specific metrics
A few words on benchmarking methodology:
The two kernel invocations in the posted code are configured differently (224 vs 226). From a quick experiment, the runtime appears to vary outside measurement noise when I change that template parameter. What happens if you configure the two kernel invocations identically?
It is never a good idea to operate on uninitalized memory. For floating-point operations in particular, execution time could differ depending the data being processed, e.g. when special operands like NaNs are involved.
When benchmarking you would want to account for cold-start effects and measurement noise. Memory-intensive code typically shows larger run-to-run variations than compute-intensive code like your kernel. A standard approach to benchmarking is therefore to execute code under test multiple times in a row and record the fastest time.
It should be noted that GPUs use dynamic clocking so processor operating frequency can vary quite widely based on temperature and power draw, with obvious impacts on performance. In this case the kernels are executed in close temporal proximity so the effect is likely small. With some GPUs one can try to fix processor clocks at specific application clocks settable with nvidia-smi, but that is usually no supported on consumer-grade GPUs like the GTX 1080.
Hi and thank you for your input! I will initialize the memory and will report back if that fixes it.
The two kernel invocations in the posted code are configured differently (224 vs 226).
Thats true! I forgot to change the other one. But the performance difference doesnt change much.
In a quick experiment the variations in your code based on address seem to be more like 10%, not 2x, though. For particular configurations, the CUDA profiler may allow you to pin-point relevant differences by drilling down on memory-specific metrics
The original kernel, which i unfortunately cant post, has larger variation. The first kernel invocation completes in 5.6ms while the second one in 11ms which is significant. It has similar access pattern with the dummy kernel that i posted but it uses the L2 cache heavily.
Now what’s interesting here is that with the original kernel, NVP reports 1.7GB device memory reads if i pass pointers at the end of the allocated space while only 900MB when i pass pointers pointing at the beginning of the space. I know its hard to come to a conclusion without the code at hand but is this normal behaviour, what could cause this?
EDIT: I forgot to add that i tried allocating just the amount of memory that i needed to run the kernel JUST below the large allocation and it run fine. So the problem arises not with the absolute? device memory location but where it is relative to the start of the big allocation if that makes sense.
I am not capable of diagnosing code that I can’t see. I think you might want to control for two mechanisms in your experiments:
(1) Basic interactions with the physical mechanism that make up the memory hierarchy. For this it would be useful to ensure that all pointers involved are aligned to 4KB pages, for example.
(2) Conflicts within the memory hierarchy (caches, TLBs, memory banks) caused by accesses to two different data objects. For this it would be useful to ensure that the difference between pointers 1 and 2 is the same as the difference between pointers 3 and 4. In C++ there is a special type for such pointer differences called ptrdiff_t.
I think there is a high likelihood that you will find that some form of conflict is the root cause of the performance differences. The stated magnitude of the observed performance differences (2x) strongly suggests that. The memory-specific metrics of the CUDA profiler may allow you to pinpoint the specific nature of this conflict, but (for lack of a need of this sort) I have no hands-on experience with that so cannot offer any more specific advice beyond that.
So i think I solved this. Or at least i found out why, depending on the pointer location, the kernel underperforms.
The pointers need to be 256-byte aligned relative to the address returned by cudaMalloc (which i assume is 256-byte aligned as well) to get the expected performance. Any ideas why this could be happening?
@njuffa Thank you for pointing me in the right direction with your comments!
“Memory allocated through the CUDA Runtime API, such as via cudaMalloc(), is guaranteed to be aligned to at least 256 bytes. Therefore, choosing sensible thread block sizes, such as multiples of the warp size (i.e., 32 on current GPUs), facilitates memory accesses by warps that are properly aligned.”
256-byte alignment has been a basic preference for memory alignment on GPUs for ages, which is why cudaMalloc() returns memory blocks with that alignment. Also, strictly required for textures, if I recall correctly.
I stopped keeping track of memory transaction sizes across GPU architectures many years ago, but if memory serves transactions sizes have been large (128 bytes, maybe bigger?) in general to make the best use of GDDR memory bandwidth. And presumably those wide transactions need to be naturally aligned for best performance, with non-aligned access being split into parts when they cross alignment boundaries. Sorry to be vague, I really haven’t been paying attention recently; I retired in 2014. Maybe Robert Crovella has deeper insights.