In my kernel all threads needs to access the same 64 bit value in memory. I can save this value in global or pinned memory. I like to know if there is a recommended way to access this cell without too much latency. I think of copying the value into a register value, doing some work in the kernel and after that check this 64 bit value:
volatile uint64_t* pGlobalMemoryFlags; // pointer to value in global/pinnen memory
uint64_t l_ui64Flags = *pGlobalMemoryFlags;
if( l_ui64Flags & 0x1 )
Would that work or is there any better or recommended way to hide the accesstime to the global memory.
If its read-only, I would send it as a kernel parameter.
yes, I send pGlobalMemoryFlags as parameter to the kernel function. I already made a few tests and noticed that constant memory is by far the fastest I can get. Without reading the value I can run about 500 kernels per seconds. If I have a single 64 bit value in global memory that I read twice in the kernel I end up with 50 kernels per second. Similar results for pinned memory. If I use constant memory I get again almost 500 kernels per second. Seems to me that I either do not have enough work to hide the latency - or the compiler removes my first read - knowing that I need the value much later.
Anyway - constant memory seems to be the way to go. Is there any way to allocate constant memory during runtime dynamically? Unfortunately I need to replace the kernel once in a few minutes - it would be nice if I can allocate constant memory once and use it for all kernels I load.
Sure, you can allocate constant memory directly.
Here is an example. If you want to update the contents in-between kernel launches, you can do that with
cudaMemcpyToSymbol() in a fashion similar to usage of a
__device__ variable. With a bit of googling you will find many examples.
I was somewhat unclear. I mean something like cuMemAlloc/cuMemFree. In the sample you show static constant memory that is already allocated with the kernel.
The global memory I need I allocate once when the application starts and I can use it for all kernel i will load - just passing the device pointer to the kernel function. So far I have not found functions to allocate/free constant memory dynamically so the I also can allocate it once and use it for all kernels.
Otherwise I need to get the var pointer after loading the kernel, setup the constant memory and run the kernel. If I load a new kernel I need to :
- invalidate the current constant memory pointer
- load the new kernel
- the the constant memory address
- copy the data I need into the memory
But my guess is that I cannot allocate constant memory from host code.
You can’t do a dynamic allocation of constant memory (other than what transpires under the hood with a kernel parameter), but if you have a single 64-bit value I’m at a loss to understand what the issue may be. I don’t know why a static allocation would not work for that. I don’t know what “is already allocated with the kernel” means.
__constant__ definition is a global (file-scope) entity. It is not per-kernel or anything else. It is allocated once, statically, and can be used by any kernel.
Thanks again for your information. I should make my points more clear. I use this 64 bit value to ‘control’ the running kernel. I change the value from the hostside while the kernel is active. This gives me a little control how the kernel does its job - you pointed me to an article how one can exchange information from gpu ↔ cpu. From my understanding this will not work with a static value passed to the kernel as parameter - by value.
I tried for this global/pinned and constant memory. The result was that global and pinned came with some performance penalties while the constant memory have almost no impact - performance wise.
All my kernels are jit compiled - depending on the current requirements. So they do not share a single file - the code is created in memory, compiled (nvrtc), loaded into the gpu and then used.
I understand that parameters of the global function uses constant memory - that is a very useful information. I was for some reason under the impression that is uses global memory.
Thanks a lot for clarifying,
While it may work, I don’t think
__constant__ memory (in any flavor) is a good choice for signalling a running kernel. The reason for this is because of the
__constant__ cache. This is a per-SM resource that caches data, and presumes that the data will not change while the kernel is running – a write to the underlying data does not invalidate the cache. I haven’t tried combining
__constant__, but I don’t think that is the right way to go either. It seems counter productive.
Yes, I agree and that is the reason for my original question. In my tests I run my kernel and checked the contents on the flags several times. I printed the location when the flag value changes for the kernel. I saw it changes during the runtime of the kernel - so it seems to work. Maybe that there is a huge latency for the const flags to reflect the value set from the host. The ptx for the access looks like expected:
ld.const.u64 %rd37, [%rd1];
%rd1 contains the address of the flags. So the volatile keyword I use seems not to have any impact.
__device__ __constant__ volatile uint64_t dev_ui64Flags[CUDA_STREAMS_PRIMARY];
I run the same kernel with an allocated global and pinned memory (cuMemAlloc/cuMemAllocHost). Using these memory as flags came with a very large performance penalty - and this is the reason for my question. Can I somehow improve the performance to access this single 64 bit value? I tried to assign the value to a register before I access the flags like this to reduce the latency of the global memory:
uint64_t l_ui64Flags = *pGlobalMemoryFlags;
if( l_ui64Flags & 0x1 )
But that didn’t gave me some improvements. So is there something else I can do to improve the performance? I need to prefetch the value from memory, but the way I tried didn’t work.
This flag is the only global memory I need to access in my sample kernel - so no other global data needs to be loaded (if that is somehow important).
The short answer is that I don’t have any suggestions to improve the latency of access to global space with the
The rest of this is just some musings, not very important.
I generally don’t suggest doing much reasoning based on PTX. It goes thru another compilation step that can change code structure significantly.
I’m not sure what sort of “impact” you expect from volatile. The primary effect that it should have is to not allow a value to be “optimized into a register”, but instead require that each and every access involve an LD or ST instruction. Since your constant load is witnessing an actual ld instruction (albeit PTX) we can’t support any reasoning that suggests that volatile didn’t have the desired effect. (For a global load, we would also expect to see a modification to bypass the L1 cache, but there is no corresponding instruction for constant AFAIK)
lets also keep in mind that the PTX constant ld instruction you have focused on tells you nothing about caching behavior. That ld can still theoretically “hit” in the cache.
for me the bottom line is still that a write to the backing store for constant does not (AFAIK) invalidate the cache, therefore whether it appears to work, that methodology is unreliable, in my view. Do as you wish of course. I wouldn’t write code depending on that mechanism.
It’s entirely plausible that if you compare the performance of repeatedly accessing data from the constant cache, vs. repeatedly accessing volatile global data, that there will be a substantial discrepancy. One is hitting in the SM cache, and one is hitting (at best) in the L2 cache. There is going to be difference in latency I’m fairly sure. The performance comparison you describe doesn’t surprise me.
In the final analysis, triggering on data in global space must involve trips to global. Whatever cost that is, will have to be incurred. Thus my initial statement.
Perhaps you can split/defuse the kernel?