Constant memory usage and comparison against textures

Tiberius · December 22, 2008, 10:34pm

I have two questions constant memory usage.

Is it possible to dynamically allocate constant memory?
My program has four arrays which are statically sized to 1KB each. These have always been this way and will always be this way - that part is easy. I have an additional six arrays that are fixed for a given problem (i.e. fixed for duration of computation but variable across different executions). Can I allocate these dynamically at the start of execution to the appropriate size? I currently have them allocated to a fixed 10k each, but my program would be much more flexible if I can dynamically allocate each one…
What is the performance of textures versus constant memory for lookup tables?
This is for the same set of arrays - all ten arrays are one dimensional. I saw a substantial speed improvement by switching the aforementioned arrays from global to constant memory. Should I expect to see an additional gain by changing them to textures? If so, can anyone point me towards an appropriate tutorial using textures for a similar purpose?

seibert · December 23, 2008, 2:22am

What is the access pattern for your lookup table? Constant memory is very fast at broadcasting the same value to many threads. Textures might be faster at linear reads, and I don’t think either will be very good at random reads. (For random access, finding a way to stage your lookup tables in shared memory has the biggest win, if you can make them fit.)

Tiberius · December 23, 2008, 3:06am

To be honest, the talk of memory access patterns is one of the subjects that I am slightly lost on. I’ve never considered/learned about it before. I will describe the way I use the table. I hope this will be useful…

I ported my CUDA code from existing C code, so I kept most of the existing structure. My code works on a number of 3D I/O arrays which have been allocated as 1D arrays. The I/O arrays are accessed in column-major fashion (i.e. index = xNYNZ + y*NZ + z). There are also a number of input-only arrays. Some of these are sized and accessed exactly as the I/O arrays, but I will work to speed these up later if possible. The arrays I am currently referring to are two sets of true 1D arrays.

The first set contains four floating point arrays sized to 256 elements each. These are basically accessed randomly. The second set of arrays is sized to NX, NY, and NZ respectively. To access these, I decompose the incoming index into an x, y, and z component.

alex_dubinsky · December 23, 2008, 4:26am

You must read the several pages in the Programming Guide that discuss this. It is the most important thing in CUDA, and there are some rules for each type of memory.

What’s missing in your description is what the access pattern looks in a single instruction across threads vs what it looks like in successive instructions. Usually, the pattern in a simultaneous instruction is the most important, but for memory types that are cached you have to also think about the pattern over time.

If the threads in a warp access the same table entry in an instruction, constant memory will be lightning fast. If they don’t, the threads will access the cmem one-by-one (ie, 16x slower) and performance from constants and textures will be comparable (not sure which will be faster exactly).

alex_dubinsky · December 23, 2008, 4:29am

No. Best you can do is combine your six 10KB arrays into one 60KB array, and cut it up however you need to.

Tiberius · December 23, 2008, 5:06am

hmmm… that’s definitely a thought. I might be able to do that.

I figured that it couldn’t be done dynamically, but I had to double check. :-)

Thanks!

Tiberius · December 23, 2008, 6:26pm

For completeness, I will repeat everything here:

My code works on a number of 3D I/O arrays which have been allocated as 1D arrays. The I/O arrays are accessed in column-major fashion (i.e. index = xNYNZ + y*NZ + z). There are also a number of input-only arrays. Some of these are sized and accessed exactly as the I/O arrays, but I will work to speed these up later if possible. The arrays I am currently referring to are two sets of true 1D arrays.

The first set contains four floating point arrays sized to 256 elements each. These are basically accessed randomly. The second set of arrays is sized to NX, NY, and NZ respectively. To access these, I decompose the incoming index into an x, y, and z component.

The various arrays are accessed as such:

I/O Arrays

During a single execution of the kernel, each thread accesses several I/O arrays. It reads one value from its own (x, y, z) location in each array and one value at (x+1, y, z), (x, y+1, z), and (x, y, z+1). It ultimately writes back to one of these I/O arrays at its (x, y, z) location (the incoming index). Across multiple executions for the kernel, a thread with a given index will always access the same array locations. This seems like the place where shared memory might be useful. Neighboring threads are reading data from the same locations. However, I don’t have a clue how to take advantage of this fact…

Input Array Set 1

These are the arrays that are sized exactly the same as the I/O arrays. They are initialized prior to the first iteration, and they remain constant throughout execution. They are too large to use constant memory, so I am very interested in any thoughts anyone has on speeding up their usage. During a single execution of the kernel, each thread accesses each array at its (x, y, z) location (the incoming index). Across multiple executions for the kernel, a thread with a given index will always access the same array locations.

Input Array Set 2

These are the arrays sized to 256 elements. During a single execution of the kernel, each thread accesses these arrays randomly. There is no relationship across threads Obviously, many threads will access a single location simultaneously. Across multiple executions of the kernel, a thread with a given index will always access the same array locations.

Input Array Set 3

The arrays sized to NX, NY, and NZ. After decomposing the incoming index into an x, y, and z location, there is a read from each array. Across threads, there would have to some sort of shared pattern to the reads… Presumably, many threads in the warp are working on the same x, y, or z location simultaneously (note the singular dimension, not ll three coordinates simultaneously). I will need to give some more thought to that… Again, each thread accesses the same location across executions of the kernel.

I hope this is the information you were looking for. If not, I will be more than glad to try to refine further. I really appreciate the help!

Tiberius · December 24, 2008, 2:53am

After a little further thought, I figured it might make sense to post a contrived example that shows basically what I said above…

__constant__ float g_cudaAlpha[_MAX_NUMBER_FLAGS];

__constant__ float g_cudaBeta[_MAX_NUMBER_FLAGS];

//_MAX_DIMENSION_IN_ONE_DIRECTION chosen so as not to exceed 64K limit.

__constant__ float g_cudaDiffWeight1[_MAX_DIMENSION_IN_ONE_DIRECTION];

__constant__ float g_cudaDiffWeight2[_MAX_DIMENSION_IN_ONE_DIRECTION];

__global__ void DoWork( float *pIOArray,

						float *pInputArray1, float *pInputArray2,

						int *pFlag,

						int nIncrementX, int nIncrementY, int nIncrementZ,

						int nMaxIndex )

{

	int nIndex = blockIdx.x*blockDim.x + threadIdx.x + blockIdx.y*blockDim.x*gridDim.x;

	

	if ( nIndex >= nMaxIndex )

	{

		return;

	}

	int x, y, z;

	x = nIndex/nIncrementX;

	y = (nIndex - x*nIncrementX)/nIncrementY;

	z = nIndex - y*nIncrementY - x*nIncrementX;

	float fDiffWeight1 = g_cudaDiffWeight1[y],

		  fDiffWeight2 = g_cudaDiffWeight2[z];

	float fDiff1, fDiff2;

	

	int nFlag;

	nFlag = pFlag[nIndex];

	

	if ( y > 0 && z > 0 )

	{

		fDiff1 = fDiffWeight1*(pInputArray1[nIndex] - pInputArray1[nIndex-nIncrementZ]);

		fDiff2 = fDiffWeight2*(pInputArray2[nIndex] - pInputArray2[nIndex-nIncrementY]);

		pIOArray[nIndex] = g_cudaAlpha[nFlag]*pIOArray[nIndex] + g_cudaBeta[nFlag]*(fDiff2 - fDiff1);

	}

}

alex_dubinsky · December 24, 2008, 5:11am

You have a lot of coalescing, for the most part, which is good.

Are you able to replace multiple kernel calls with a for() loop inside the kernel? This way, you only have to read all those values in once (into registers, not shared mem).

Shared memory is not saved across kernel calls.

Ocire · December 24, 2008, 6:23pm

reading (x,y,z) and (x+1,y,z) could be speed up using shared memory. let all threads of a block (successive x) read (x,y,z) into shared mem, let the last thread of the block also read (x+1,y,z). then use the shared mem to access the values. this saves you 1/2-1 of the global fetches.

btw: make sure, that all threads within a block have the same y and z, use padding if necessary

i can’t see any speed-up other than making sure they are coalesced here. :-(

sounds like they are perfect for constant memory. if you don’t have enough const mem left, use texture mem.

if you get all thread blocks to have the same y and z, you could speed up the reads on y and z by using shared mem. this way, only one warp has to read the two values instead of all the warps.

i hope it helps getting you the missing seconds. :-p

Topic		Replies	Views
Constant Arrays CUDA Programming and Performance	13	30941	November 24, 2007
Constant memory allocation and initialization CUDA Programming and Performance	12	83113	November 20, 2010
Really slow constant memory Random access to constant memory CUDA Programming and Performance	13	4719	December 4, 2009
Should I use constant memory or Texture? CUDA Programming and Performance	8	11688	February 20, 2008
Constant or Texture Memory Which is better for my application? CUDA Programming and Performance	3	2467	November 16, 2007
Am I using the constant memory properly? CUDA Programming and Performance	4	1879	May 15, 2013
Constant memory vs shared memory LUT CUDA Programming and Performance	4	5836	April 22, 2008
Warp Serialisation and Constant Memory Performance Surprise CUDA Programming and Performance	7	4033	March 3, 2009
Why texture/constant memory under FERMI architecture CUDA Programming and Performance	23	4407	November 3, 2010
Should I use shared, constant or texture memory for this application? CUDA Programming and Performance	2	345	June 10, 2023

Constant memory usage and comparison against textures

Related topics