Currently I have a small buffer (an array of 100 or so floats used as a reference for all calculations in my kernel function). I transfer this array from the host to the device initially, and pass it in as a parameter to the kernel for direct read-only use.
I am pretty sure there is a better way of making it constant and accessable from cache, not global memory, since the array is never changed, just read from.
What can I do to make it faster than a simple global access in the kernel?
I think the constant keyword is what I wanted :) with that, all accesses are cached on chip and read much faster, am I correct? Because a read from a global takes 600 clock cycles or so, I really want to take that out Each thread accesses arbitrary random positions in the array multiple times, so I can’t really use coallesence (sp) or spatial locality…
Indeed, constant memory is cached, and will boost the performance significantly compared to straight global memory.
However, to optimize to the next level, you need to be aware that constant memory performs optimally only when all threads in a warp are reading the same value from constant memory. (it’s just the way the cache was built). If your many threads are reading randomly from the data as you say, you may find better performance to do a coalesced load of those ~100 floats into shared memory at the beginning of the kernel and then read randomly from that shared memory. This will likely be faster than constant memory for random reads from thread to thread
I.e. (assuming a 1-D block)
__global__ void kernel(..... float *d_const, int num_const)
extern __shared__ float c_const;
// load the const data in a sliding window
for (unsigned int start = 0; start < num_const; start += blockDim.x)
if (start + threadIdx.x < num_const)
c_const[start + threadIdx.x] = d_const[start + threadIdx.x]; // fully coalesced load
// continue with existing kernel reading from c_const instead of d_const.
Oh. The for loop is there because I copied and pasted that code from a kernel where I do something similar. In that case, the array can potentially be larger than the block size, so the for loop is to populate all elements of the array blockDim.x elements at a time. So I’m just programming for the general case. If you know your array is to be fixed at 100 elements, you can simplify the code a bit by using a static shared array and always remembering not to make your block size smaller than 100.