Small const array accessable globally? Is it easy and possible?


Currently I have a small buffer (an array of 100 or so floats used as a reference for all calculations in my kernel function). I transfer this array from the host to the device initially, and pass it in as a parameter to the kernel for direct read-only use.

I am pretty sure there is a better way of making it constant and accessable from cache, not global memory, since the array is never changed, just read from.

What can I do to make it faster than a simple global access in the kernel?

Quick example:
CUDA_SAFE_CALL(cudaMalloc((void**)&buffer_sim_dev_, buffer_sim_size_*sizeof(float)));
CUDA_SAFE_CALL(cudaMemcpy(buffer_sim_dev_, buffer_sim_, buffer_sim_size_*sizeof(float), cudaMemcpyHostToDevice));

kernel<<<grids, blocks>>> (buffer_sim_dev_)

float x = buffer_sim_dev[some_random_access_id];


Do all threads access the same some_random_access_id at the same time? If so, you want to put your array in constant memory.

If not, then either binding the array to a texture and reading with tex1Dfetch or loading the array into shared memory at the beginning of each block will be your optimal choice.

I think the constant keyword is what I wanted :) with that, all accesses are cached on chip and read much faster, am I correct? Because a read from a global takes 600 clock cycles or so, I really want to take that out Each thread accesses arbitrary random positions in the array multiple times, so I can’t really use coallesence (sp) or spatial locality…

Indeed, constant memory is cached, and will boost the performance significantly compared to straight global memory.

However, to optimize to the next level, you need to be aware that constant memory performs optimally only when all threads in a warp are reading the same value from constant memory. (it’s just the way the cache was built). If your many threads are reading randomly from the data as you say, you may find better performance to do a coalesced load of those ~100 floats into shared memory at the beginning of the kernel and then read randomly from that shared memory. This will likely be faster than constant memory for random reads from thread to thread

I.e. (assuming a 1-D block)

__global__ void kernel(..... float *d_const, int num_const)


	  extern __shared__ float c_const[];

	  // load the const data in a sliding window

	  for (unsigned int start = 0; start < num_const; start += blockDim.x)


		  if (start + threadIdx.x < num_const)

			   c_const[start + threadIdx.x] = d_const[start + threadIdx.x];	// fully coalesced load



	  // continue with existing kernel reading from c_const instead of d_const.


Thats what I’m going to try now, but I’ve got a question with that coalesced load: why do you have that for loop there? The array will be populated the same way regardless right?

Also, how would I initialize shared memory with a dynamic size? It only seems to work if I set it to a constant, not as a pointer (ie shared float f). Is there a shared memory specific malloc?

Check out section shared of the 2.1 programming guide.

short array0[128];

float array1[64];

int array2[256];
extern __shared__ char array[];

__device__ void func() // __device__ or __global__ function


short* array0 = (short*)array;

float* array1 = (float*)&array0[128];

int* array2 = (int*)&array1[64];


Then you need to pass the amount of shared memory to the kernel as the 3rd configuration parameter. See section 4.2.3 Execution Configuration:

Func<<< Dg, Db, Ns >>>(parameter);

Oh. The for loop is there because I copied and pasted that code from a kernel where I do something similar. In that case, the array can potentially be larger than the block size, so the for loop is to populate all elements of the array blockDim.x elements at a time. So I’m just programming for the general case. If you know your array is to be fixed at 100 elements, you can simplify the code a bit by using a static shared array and always remembering not to make your block size smaller than 100.