Accessing/caching access to global/pinned memory

TrailingStop · May 9, 2023, 10:16am

Hi,

In my kernel all threads needs to access the same 64 bit value in memory. I can save this value in global or pinned memory. I like to know if there is a recommended way to access this cell without too much latency. I think of copying the value into a register value, doing some work in the kernel and after that check this 64 bit value:

volatile uint64_t* pGlobalMemoryFlags; // pointer to value in global/pinnen memory

uint64_t l_ui64Flags = *pGlobalMemoryFlags;

do_stuff_that_takes_some_time();

if( l_ui64Flags & 0x1 )
{
do_other_stuff();
}

Would that work or is there any better or recommended way to hide the accesstime to the global memory.

Thanks,
Daniel

Robert_Crovella · May 9, 2023, 1:38pm

If its read-only, I would send it as a kernel parameter.

TrailingStop · May 9, 2023, 1:47pm

Hi,

yes, I send pGlobalMemoryFlags as parameter to the kernel function. I already made a few tests and noticed that constant memory is by far the fastest I can get. Without reading the value I can run about 500 kernels per seconds. If I have a single 64 bit value in global memory that I read twice in the kernel I end up with 50 kernels per second. Similar results for pinned memory. If I use constant memory I get again almost 500 kernels per second. Seems to me that I either do not have enough work to hide the latency - or the compiler removes my first read - knowing that I need the value much later.

Anyway - constant memory seems to be the way to go. Is there any way to allocate constant memory during runtime dynamically? Unfortunately I need to replace the kernel once in a few minutes - it would be nice if I can allocate constant memory once and use it for all kernels I load.

Thanks,
Daniel

Robert_Crovella · May 9, 2023, 1:55pm

Sure, you can allocate constant memory directly.

Here is an example. If you want to update the contents in-between kernel launches, you can do that with cudaMemcpyToSymbol() in a fashion similar to usage of a __device__ variable. With a bit of googling you will find many examples.

TrailingStop · May 9, 2023, 2:03pm

I was somewhat unclear. I mean something like cuMemAlloc/cuMemFree. In the sample you show static constant memory that is already allocated with the kernel.

The global memory I need I allocate once when the application starts and I can use it for all kernel i will load - just passing the device pointer to the kernel function. So far I have not found functions to allocate/free constant memory dynamically so the I also can allocate it once and use it for all kernels.

Otherwise I need to get the var pointer after loading the kernel, setup the constant memory and run the kernel. If I load a new kernel I need to :

invalidate the current constant memory pointer
load the new kernel
the the constant memory address
copy the data I need into the memory

But my guess is that I cannot allocate constant memory from host code.

Thanks,
Daniel

Robert_Crovella · May 9, 2023, 2:12pm

You can’t do a dynamic allocation of constant memory (other than what transpires under the hood with a kernel parameter), but if you have a single 64-bit value I’m at a loss to understand what the issue may be. I don’t know why a static allocation would not work for that. I don’t know what “is already allocated with the kernel” means.

A __constant__ definition is a global (file-scope) entity. It is not per-kernel or anything else. It is allocated once, statically, and can be used by any kernel.

TrailingStop · May 9, 2023, 2:57pm

Thanks again for your information. I should make my points more clear. I use this 64 bit value to ‘control’ the running kernel. I change the value from the hostside while the kernel is active. This gives me a little control how the kernel does its job - you pointed me to an article how one can exchange information from gpu ↔ cpu. From my understanding this will not work with a static value passed to the kernel as parameter - by value.

I tried for this global/pinned and constant memory. The result was that global and pinned came with some performance penalties while the constant memory have almost no impact - performance wise.
All my kernels are jit compiled - depending on the current requirements. So they do not share a single file - the code is created in memory, compiled (nvrtc), loaded into the gpu and then used.

I understand that parameters of the global function uses constant memory - that is a very useful information. I was for some reason under the impression that is uses global memory.

Thanks a lot for clarifying,

Daniel

Robert_Crovella · May 9, 2023, 8:24pm

While it may work, I don’t think __constant__ memory (in any flavor) is a good choice for signalling a running kernel. The reason for this is because of the __constant__ cache. This is a per-SM resource that caches data, and presumes that the data will not change while the kernel is running – a write to the underlying data does not invalidate the cache. I haven’t tried combining volatile with __constant__, but I don’t think that is the right way to go either. It seems counter productive.

TrailingStop · May 10, 2023, 2:58am

Yes, I agree and that is the reason for my original question. In my tests I run my kernel and checked the contents on the flags several times. I printed the location when the flag value changes for the kernel. I saw it changes during the runtime of the kernel - so it seems to work. Maybe that there is a huge latency for the const flags to reflect the value set from the host. The ptx for the access looks like expected:

ld.const.u64 %rd37, [%rd1];

%rd1 contains the address of the flags. So the volatile keyword I use seems not to have any impact.

__device__ __constant__ volatile uint64_t dev_ui64Flags[CUDA_STREAMS_PRIMARY];

I run the same kernel with an allocated global and pinned memory (cuMemAlloc/cuMemAllocHost). Using these memory as flags came with a very large performance penalty - and this is the reason for my question. Can I somehow improve the performance to access this single 64 bit value? I tried to assign the value to a register before I access the flags like this to reduce the latency of the global memory:

uint64_t l_ui64Flags = *pGlobalMemoryFlags;
do_stuff_that_takes_some_time();

if( l_ui64Flags & 0x1 )
{
do_other_stuff();
}

But that didn’t gave me some improvements. So is there something else I can do to improve the performance? I need to prefetch the value from memory, but the way I tried didn’t work.

This flag is the only global memory I need to access in my sample kernel - so no other global data needs to be loaded (if that is somehow important).

Thanks.

Robert_Crovella · May 10, 2023, 4:16am

The short answer is that I don’t have any suggestions to improve the latency of access to global space with the volatile keyword.

The rest of this is just some musings, not very important.

I generally don’t suggest doing much reasoning based on PTX. It goes thru another compilation step that can change code structure significantly.
I’m not sure what sort of “impact” you expect from volatile. The primary effect that it should have is to not allow a value to be “optimized into a register”, but instead require that each and every access involve an LD or ST instruction. Since your constant load is witnessing an actual ld instruction (albeit PTX) we can’t support any reasoning that suggests that volatile didn’t have the desired effect. (For a global load, we would also expect to see a modification to bypass the L1 cache, but there is no corresponding instruction for constant AFAIK)
lets also keep in mind that the PTX constant ld instruction you have focused on tells you nothing about caching behavior. That ld can still theoretically “hit” in the cache.
for me the bottom line is still that a write to the backing store for constant does not (AFAIK) invalidate the cache, therefore whether it appears to work, that methodology is unreliable, in my view. Do as you wish of course. I wouldn’t write code depending on that mechanism.
It’s entirely plausible that if you compare the performance of repeatedly accessing data from the constant cache, vs. repeatedly accessing volatile global data, that there will be a substantial discrepancy. One is hitting in the SM cache, and one is hitting (at best) in the L2 cache. There is going to be difference in latency I’m fairly sure. The performance comparison you describe doesn’t surprise me.
In the final analysis, triggering on data in global space must involve trips to global. Whatever cost that is, will have to be incurred. Thus my initial statement.

Curefab · May 29, 2023, 12:25am

Perhaps you can split/defuse the kernel?

Topic		Replies	Views
constant cache no faster than global mem? constant memory access CUDA Programming and Performance	5	5710	March 6, 2008
access constant memory CUDA Programming and Performance	1	711	August 8, 2015
Constant memory allocation and initialization CUDA Programming and Performance	12	82718	November 20, 2010
Constant Vs Global Memory Performance CUDA Programming and Performance	2	2646	October 31, 2015
Why does it need constant memory CUDA Programming and Performance	4	819	October 10, 2013
Optimizing a Kernel with a lot of variables - memory allocation. CUDA Programming and Performance	6	746	August 5, 2016
__constant__ memory failed to load. CUDA Programming and Performance	3	2586	August 25, 2007
Measuring global memory access speed CUDA Programming and Performance	9	2125	October 25, 2018
Some problems with constant memory CUDA Programming and Performance	1	681	December 9, 2013
dynamic allocation of constant memory possible? CUDA Programming and Performance	4	12010	June 26, 2008

Accessing/caching access to global/pinned memory

Related topics