The algorithm I’m programming into CUDA has structures of data, which could be single variables or arrays. Each thread needs to operate on one structure.
What I’ve done, so that memory access can be coalesced is to make a bunch of big arrays, both on host and on the device. Each array is index such that the pieces that a particular thread is working on in an array are access serially, as to allow coalesced memory access. In the previous, CPU run, serial algorithm, each structure had about 10 or 15 elements. A lot of the data was redundant and the same for all threads, so I simply put that into the local memory (symbols…) which seems to work fine. But, for some kernels, I’m passing like 10 pointers each to different arrays, then I have an odd indexing scheme for each thread to figure out where to start in each array. For example, if I have an array A, each thread may access between 10-20 pieces of information. All threads in one warp should access the same number of pieces, but different warps could be different number of pieces.
It is getting quite tedious keeping track of all these different arrays, and their indexing schemes. Is this the only way to do this efficiently? How are other people handling diverse data sets?
The second thing is a quick question about shared memory. For example, if for each thread I calculate it’s thread index, and I’d like to store that in shared memory (although I’d probably rather that be in a register), would I allocate shared int sharedData? and each thread in a warp would access one piece in that shared data? Then, would my threadblock have dimensions of <32,y,z> and I’d use threadIdx.x to access the shared data like this: sharedData[threadIdx.x]? or If I just wanted one int per thread, would I allocate shared data like this: shared int sharedButReallyNot. Then each thread would get it’s own “sharedButReallyNot” piece?
Lastly, My application uses on the order of thousands of random floats per thread. What I do is generate a huge array of random numbers that will satisfy each thread in each block for the entire kernel using the MT example from the SDK, then call my kernel. I think this is probably an efficient way of doing this, but correct me if you think there’s a better way. I’m not really running out of memory on the device so that’s probably not an issue.
Sorry if this is obvious, I’m still learning. On the bright side, my application so far shows HUGE speed ups, but that’s probably greatly in part to the original algorithm being terrible even for a serial CPU. I think there’s bigger improvements to be made. Also it’s not yet generating the correct results :(… but I’m pretty sure it’s doing all the necessary steps and just needs some better input data. garbage in = garbage out…
Thanks a bunch,