Constant Arrays

One of the most serious performance issues I have run into thus far with CUDA is constant arrays. These arrays are stored in symbol memory along with all the other constants used in the CUDA programs. CUDA is very good at accessing constants as long as they are the same constants in each thread of a warp. The performance problem happens when different threads in a warp need to index different elements of a constant array. This causes an additional read cycle. The actual performance hit is actually quite large. I am not sure whether these read cycles can be hidden by other computation instructions.

The code I am working on uses a very small array of just 256 elements. This obviously fits into the 8kb perprocessor symbol memory cache. From what I have read, the current GPU hardware does not support register or cache indexing. If it did, the threads in a warp would be able to fetch any element from a short array without incurring any additional read cycles.

Are there any tricks I can use to speed up access to short arrays? Would moving them to texture memory help at all? Can we expect any improvements in this area in future hardware or CUDA revisions?

-Mark Granger
NewTek

You could try putting the array into global memory, then reading the array into shared memory at the start of your kernel, making sure your read is coalesced of course.
I’m not sure it is faster than using constant memory but it might be.

With CUDA, there is no substitute for experimentation. Of course, any benchmark is only good at testing how fast the benchmark runs. Latency hiding from adding real computations and/or application specific memory access patterns WILL change results significantly. But I post this anyways.

I made a simple benchmark that only does memory access through a small 256 uint array, using each value read as a distance to hop to the next read. See the code below. I tested it in two modes: one where all threads in the warp are always accessing the same elements and one where accesses are completely random. Timing is done with the CUDA_PROFILER=1.

Here are the results, including an “effective” memory bandwidth counting all reads from the 256 element array:

Random mode:

                      Const,  Tex,  Shmem

gputime (us)       95951.84,   123481.02,   42771.76

Greads/s                 13.34,   10.37,   29.93

GBytes/s                 49.7,   38.62,   111.5

  	

	Warp coherent mode  

                        Const,   Tex,   Shmem

gputime (us)        17542.92,   70186,   20436

Greads/s               72.97,   18.24,   62.64

GBytes/s             271.84,   67.95,   233.36

I’m a bit disappointed by the texture cache. Doesn’t the guide say that values in cache no access latency? Why then are random accesses from an array that entirely fits in cache so slow, even compared to the global memory bandwidth of 86.4 GB/s?? I guess this is just another example that the texture cache is really only useful when a single warp is accessing a very local region of the texture (i.e much smaller than a 1K spread). Temporal locality from one read to the next seems to have absolutely no bearing on the texture cache performance.

Anyways, I digress. It seems that the way to go for small random access is definitely the shared memory, being more than twice as fast as constant memory. If bank conflicts can be avoided even with the random accesses, that speed should increase.

Here is the code:

#include <stdio.h>

#include <stdlib.h>

__constant__ unsigned int c_vals[256];

texture<unsigned int, 1, cudaReadModeElementType> tex;

__global__ void const_speedtest(unsigned int *d_outvals)

	{

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	unsigned int cur = c_vals[threadIdx.x];

	for (int i = 0; i < 1000; i++)

  {

  cur = c_vals[cur] & 0xff;

  cur = c_vals[cur] & 0xff;

  cur = c_vals[cur] & 0xff;

  cur = c_vals[cur] & 0xff;

  cur = c_vals[cur] & 0xff;

  cur = c_vals[cur] & 0xff;

  cur = c_vals[cur] & 0xff;

  cur = c_vals[cur] & 0xff;

  cur = c_vals[cur] & 0xff;

  cur = c_vals[cur] & 0xff;

  }

	d_outvals[idx] = cur;

	}

__global__ void tex_speedtest(unsigned int *d_outvals)

	{

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	unsigned int cur = tex1Dfetch(tex, threadIdx.x);

	for (int i = 0; i < 1000; i++)

  {

  cur = tex1Dfetch(tex, cur) & 0xff;

  cur = tex1Dfetch(tex, cur) & 0xff;

  cur = tex1Dfetch(tex, cur) & 0xff;

  cur = tex1Dfetch(tex, cur) & 0xff;

  cur = tex1Dfetch(tex, cur) & 0xff;

  cur = tex1Dfetch(tex, cur) & 0xff;

  cur = tex1Dfetch(tex, cur) & 0xff;

  cur = tex1Dfetch(tex, cur) & 0xff;

  cur = tex1Dfetch(tex, cur) & 0xff;

  cur = tex1Dfetch(tex, cur) & 0xff;

  }

	d_outvals[idx] = cur;

	}

__global__ void shmem_speedtest(unsigned int *d_outvals, unsigned int *d_invals)

	{

	__shared__ unsigned int shared[256];

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	

	shared[threadIdx.x] = d_invals[threadIdx.x];

	unsigned int cur = shared[threadIdx.x];

	for (int i = 0; i < 1000; i++)

  {

  cur = shared[cur] & 0xff;

  cur = shared[cur] & 0xff;

  cur = shared[cur] & 0xff;

  cur = shared[cur] & 0xff;

  cur = shared[cur] & 0xff;

  cur = shared[cur] & 0xff;

  cur = shared[cur] & 0xff;

  cur = shared[cur] & 0xff;

  cur = shared[cur] & 0xff;

  cur = shared[cur] & 0xff;

  }

	d_outvals[idx] = cur;

	}

int main(void)

	{

	int n_blocks = 500;

	int block_size = 256;

	// allocate global memory for output

	unsigned int *d_outvals;

	cudaMalloc((void**)&d_outvals, sizeof(unsigned int) * n_blocks * block_size);

	

	// host memory for values

	unsigned int vals[256];

	// RANDOM MODE

	for (int i = 0; i < 256; i++)

  vals[i] = rand() & 0xff;

	

	// global memory for values

	unsigned int *d_vals;

	cudaMalloc((void**)&d_vals, sizeof(unsigned int) * 256);

	// copy memory to device

	cudaMemcpy(d_vals, vals, sizeof(unsigned int) * 256, cudaMemcpyHostToDevice);

	cudaBindTexture(0,tex,d_vals,sizeof(unsigned int) * 256);

	cudaMemcpyToSymbol(c_vals, vals, sizeof(unsigned int) * 256);

	// run the kernels

	for (int i = 0; i < 10; i++)

  {

  const_speedtest<<< n_blocks, block_size >>>(d_outvals);

  }

	cudaThreadSynchronize();

	for (int i = 0; i < 10; i++)

  {

  tex_speedtest<<< n_blocks, block_size >>>(d_outvals);

  }

	cudaThreadSynchronize();

	for (int i = 0; i < 10; i++)

  {

  shmem_speedtest<<< n_blocks, block_size >>>(d_outvals, d_vals);

  }

	cudaThreadSynchronize();

	

	// WARP-COHERENT MODE

	for (int i = 0; i < 256; i+=32)

  {

  vals[i] = rand() & 0xff;

  for (int j = i+1; j < i+32; j++)

  	vals[j] = vals[i];

  }

	// copy memory to device

	cudaMemcpy(d_vals, vals, sizeof(unsigned int) * 256, cudaMemcpyHostToDevice);

	cudaMemcpyToSymbol(c_vals, vals, sizeof(unsigned int) * 256);

	

	// run the kernels again

	for (int i = 0; i < 10; i++)

  {

  const_speedtest<<< n_blocks, block_size >>>(d_outvals);

  }

	cudaThreadSynchronize();

	for (int i = 0; i < 10; i++)

  {

  tex_speedtest<<< n_blocks, block_size >>>(d_outvals);

  }

	cudaThreadSynchronize();

	for (int i = 0; i < 10; i++)

  {

  shmem_speedtest<<< n_blocks, block_size >>>(d_outvals, d_vals);

  }

	cudaThreadSynchronize();

	return 0;

	}

I don’t know many details of how a bound texture and its equivalent global mem interact, but is there any chance that tex_speedtest warmed the texture cache and the shmem_speedtest access to global memory hit in the texture cache (even though they weren’t bound to the texture)?

No, standard global memory reads are not cached and performed with a different instruction than a texture read. Plus, the initial global memory read in that kernel is coalesced so its performance cannot get any better than it is. The silly benchmarking loop is so long that you’d never notice the time it takes to perform the initial read anyways.

What is CUDA_PROFILER? How to use it?

CUDA_PROFILER is an environment variable. Read the documentation that came with the toolkit download. You may want to check out the new features in the 1.1 profiler. It can do cool things like tell you how many uncoalsced global memory reads your kernel does.

This is a very interesting test. I have a dumb coding question on this shmem_test: suppose for input, we only have some data stored somewhere (device memory, constant memory or texture memory) and we want to load them into shared memory as quick as possible. If the number of data is equal to the number of thread, we can write:

shared[threadIdx.x] = d_invals[threadIdx.x];

But if not equal, then how to write the code efficiently?

For example, there are 256 threads, but I have about 20 floats that need to be loaded into shared memory, so they can be used by all threads. Actually, in this case, do we still want to load them into memory? Because it seems that if we do this, bank conflicts will be inevitable.

Bank conflict only happens when threads read from different address in the same bank. For 20 floats, we can just pass them via a parameter (don’t know whether nvcc has supported parameter space addressing yet, but ptx has). If more than that but less than 256, a load in an if may be usually sufficient. If more, we can use a load in a for.

Thank you. I have a question about constant memory. When declaring constant memory, it seems that we have to specify the size. What if the size of the constant memory depends on the host application? Can we still use constant memory?

Another question: if in all the threads, I attempt to load same data in device memory into shared memory, will this be slow? Because it seems that the memory access will have to be serialized?

Don’t have all threads read the same value from global memory: this is uncoalesced and slow. Instead, have only threadIdx.x == 0 load the value into shared memory and then __syncthreads(). As long as the value you are loading is allocated with cudaMalloc, it will have the correct alignment to be read “coalesced” even though there is only one read.

There are many old threads on this forum discussing dynamic constant memory. You need to do the allocation yourself: nvcc only supports constant memory allocated at compile time.

I searched and read those posts. The answer is: we cannot use dynamic constant memory as we have to set a size during declaration. External Media

I will have to use texture memory because I don’t have to set the size during declaration of texture memory, and at least I can change the size of texture memory array from the CPU side.

Thank you,

According to Section 5.1.2.3 in Programming Guide v1.1, I guess if you test 2D arrays, the performance of texture cache may improve since texture memory is optimized for 2D.