Let’s start with the fact that I am fairly new to cuda so I might be missing something obvious!
So when i allocate a huge chunk of memory (7GB) at once, if i pass the kernel, pointers that point at the beginning of the allocated space, the kernel performs as expected. Thing is, when i pass pointers which point at the end of the allocated chunk, i get worse performance, sometimes up to 2x slower.
Cuda-memcheck isnt reporting any error and i even checked the pointers to see if they actually point to device memory. which they do. I have pasted below some dummy code that exhibits the problem. On my machine the first kernel invocation runs in about 5.6ms while the second one in 7.6ms. I cant think of any reason why the second one is slower. Any ideas?
I havent initialised the memory as i am just interested in testing performance but i have made sure, on another file, that both the original kernel and the dummy one produce the correct output.
I am using a gtx 1080, Cuda 11.2, Ubuntu 20.04.
Here’s a pastebin link to the code. Any help would be greatly appreciated!