Is it possible to dynamically allocate constant memory?
My program has four arrays which are statically sized to 1KB each. These have always been this way and will always be this way - that part is easy. I have an additional six arrays that are fixed for a given problem (i.e. fixed for duration of computation but variable across different executions). Can I allocate these dynamically at the start of execution to the appropriate size? I currently have them allocated to a fixed 10k each, but my program would be much more flexible if I can dynamically allocate each one…
What is the performance of textures versus constant memory for lookup tables?
This is for the same set of arrays - all ten arrays are one dimensional. I saw a substantial speed improvement by switching the aforementioned arrays from global to constant memory. Should I expect to see an additional gain by changing them to textures? If so, can anyone point me towards an appropriate tutorial using textures for a similar purpose?
What is the access pattern for your lookup table? Constant memory is very fast at broadcasting the same value to many threads. Textures might be faster at linear reads, and I don’t think either will be very good at random reads. (For random access, finding a way to stage your lookup tables in shared memory has the biggest win, if you can make them fit.)
To be honest, the talk of memory access patterns is one of the subjects that I am slightly lost on. I’ve never considered/learned about it before. I will describe the way I use the table. I hope this will be useful…
I ported my CUDA code from existing C code, so I kept most of the existing structure. My code works on a number of 3D I/O arrays which have been allocated as 1D arrays. The I/O arrays are accessed in column-major fashion (i.e. index = xNYNZ + y*NZ + z). There are also a number of input-only arrays. Some of these are sized and accessed exactly as the I/O arrays, but I will work to speed these up later if possible. The arrays I am currently referring to are two sets of true 1D arrays.
The first set contains four floating point arrays sized to 256 elements each. These are basically accessed randomly. The second set of arrays is sized to NX, NY, and NZ respectively. To access these, I decompose the incoming index into an x, y, and z component.
You must read the several pages in the Programming Guide that discuss this. It is the most important thing in CUDA, and there are some rules for each type of memory.
What’s missing in your description is what the access pattern looks in a single instruction across threads vs what it looks like in successive instructions. Usually, the pattern in a simultaneous instruction is the most important, but for memory types that are cached you have to also think about the pattern over time.
If the threads in a warp access the same table entry in an instruction, constant memory will be lightning fast. If they don’t, the threads will access the cmem one-by-one (ie, 16x slower) and performance from constants and textures will be comparable (not sure which will be faster exactly).
My code works on a number of 3D I/O arrays which have been allocated as 1D arrays. The I/O arrays are accessed in column-major fashion (i.e. index = xNYNZ + y*NZ + z). There are also a number of input-only arrays. Some of these are sized and accessed exactly as the I/O arrays, but I will work to speed these up later if possible. The arrays I am currently referring to are two sets of true 1D arrays.
The first set contains four floating point arrays sized to 256 elements each. These are basically accessed randomly. The second set of arrays is sized to NX, NY, and NZ respectively. To access these, I decompose the incoming index into an x, y, and z component.
The various arrays are accessed as such:
I/O Arrays
During a single execution of the kernel, each thread accesses several I/O arrays. It reads one value from its own (x, y, z) location in each array and one value at (x+1, y, z), (x, y+1, z), and (x, y, z+1). It ultimately writes back to one of these I/O arrays at its (x, y, z) location (the incoming index). Across multiple executions for the kernel, a thread with a given index will always access the same array locations. This seems like the place where shared memory might be useful. Neighboring threads are reading data from the same locations. However, I don’t have a clue how to take advantage of this fact…
Input Array Set 1
These are the arrays that are sized exactly the same as the I/O arrays. They are initialized prior to the first iteration, and they remain constant throughout execution. They are too large to use constant memory, so I am very interested in any thoughts anyone has on speeding up their usage. During a single execution of the kernel, each thread accesses each array at its (x, y, z) location (the incoming index). Across multiple executions for the kernel, a thread with a given index will always access the same array locations.
Input Array Set 2
These are the arrays sized to 256 elements. During a single execution of the kernel, each thread accesses these arrays randomly. There is no relationship across threads Obviously, many threads will access a single location simultaneously. Across multiple executions of the kernel, a thread with a given index will always access the same array locations.
Input Array Set 3
The arrays sized to NX, NY, and NZ. After decomposing the incoming index into an x, y, and z location, there is a read from each array. Across threads, there would have to some sort of shared pattern to the reads… Presumably, many threads in the warp are working on the same x, y, or z location simultaneously (note the singular dimension, not ll three coordinates simultaneously). I will need to give some more thought to that… Again, each thread accesses the same location across executions of the kernel.
I hope this is the information you were looking for. If not, I will be more than glad to try to refine further. I really appreciate the help!
You have a lot of coalescing, for the most part, which is good.
Are you able to replace multiple kernel calls with a for() loop inside the kernel? This way, you only have to read all those values in once (into registers, not shared mem).
reading (x,y,z) and (x+1,y,z) could be speed up using shared memory. let all threads of a block (successive x) read (x,y,z) into shared mem, let the last thread of the block also read (x+1,y,z). then use the shared mem to access the values. this saves you 1/2-1 of the global fetches.
btw: make sure, that all threads within a block have the same y and z, use padding if necessary
i can’t see any speed-up other than making sure they are coalesced here. :-(
sounds like they are perfect for constant memory. if you don’t have enough const mem left, use texture mem.
if you get all thread blocks to have the same y and z, you could speed up the reads on y and z by using shared mem. this way, only one warp has to read the two values instead of all the warps.
i hope it helps getting you the missing seconds. :-p