Is it possible to allocate variables in shared memory that will be private to a particular thread in a block? For instance, what is the best way to go about doing something like this:
var1 = some calculated value from global mem vars[thread];
global_var[thread] = var1;
Global memory is just too slow and often ends being uncoalesced when trying to access different offsets from within the same kernel, for instance:
a[(i+1)+const*(j+1)] = b[i+const*j];
gives me uncoalesced writes. If you flip the assignment (b=a) then it gives uncoalesced reads. Thanks in advance.
I get uncoalesced reads (using gld_incoherent from the profiler). What would be a good way to coalesce the reads without sacrificing the coalesced writes?
As you have said, using shared memory as a buffer can get you coalescing both ways. Have each thread read its coalesced value from b to shared , syncthreads, then have each thread write its coalesced value from (uncoalesced) position in shared to a.
As for your original question, i did start to reply, but then figured i didnt understand your question, since, shared memory is… well… shared!
The solution of simply saying: use shared memory does not work because the alignment of the a array (form above) is not a multiple of 16, so it will never be coalesced. I think I will have to either pad or fan the array out so that every access is a multiple of 16. This is a waste of memory, but as long as I don’t run out then I don’t care. Does this sound right?
Im definitely missing something youre telling me then. As long as you have as many threads as array elements, coalescing from a buffer should always be possible, as far as i can think.
Coalesced read to buffer to random access shared memory write
random acces shared memory read to coalecsed write.
I realize just repeating this is probably not gonna help you in the slightest, so i hope for you that someone else jumps in!
Well, there still is the same issue as with the original question: You give us not the slightest clue what i, j, k and size1, size2 are. At the very least we would have to know how they depend on threadIdx.x.
But either way I think what you are missing is that you can just make thread 0 not read anything and still have a coalesced access, i.e.:
i = threadIdx.x;
if (i) tmp[i-1] = b[i];
__syncthreads();
a[i] = tmp[i];
You will have to think of some way to handle the last element though.
My understanding is that textures are read-only. Unfortunately for me, both of these arrays are written to and read from in at 1 of 5 kernels, sometimes both in 1 kernel. Using texture or constant memory for one of the arrays would require a transfer back and forth from CPU-GPU, which would slow down performance even worse then the uncoalesced reads/writes.
Right now, each array takes ~300us to go from CPU to GPU. Times to for GPU->CPU and back CPU->GPU and then times two for the two arrays: 4*300us = ~1200us. The first kernel (which has uncoalesced stores) takes ~450us altogether. The other kernel (which has uncolasced loads) takes ~300us altogether.
While I agree that coalescing is not always possible, it seems like this is a relatively strait forward candidate for coalescing.
I think what you’re missing is that the same thread does not need to continue to work on the same data, infact in order to coalesce you need to have threads change what they’re working on.
I’m going to go back to your original example:
a[(i+1)+const*(j+1)] = b[i+const*j];
Let’s just put this small array for reference:
[10,19,42,17]
in order to coalesce this you perform the following:
temp[(i+1) + const*(j+1)] = b[i+const*j]; // if i == 0, working with “10”; i==1, Working with “19”
__syncthreads();
if(i) a[i+const*j] = temp[i+const*j]; // same thread, if i == 0, working with null (don’t do that); i==1 Working with “10”
In this scenario you will see all of the array values except for the last item assigned… so you will need a spare thread to pick up the slack; (and you’ll have to assure it doesn’t do anything prior to the syncthreads().
This is incorrect and unsafe, at least if I understand what the OP is trying to do.
CUDA does not guarantee that if you write to the linear memory backing the texture, that the texture cache will remain coherent with those writes. In other words, if you write to the global memory behind a texture, a subsequent texture fetch may or may not show the update - it’s undefined.
So, it’s not safe to treat textures as writable in the sense that you can read the new values back out during the same kernel execution. Furthermore, since there aren’t any consistency guarantees, you may not even consistently get the old data back out - you could get a mix of old and new.
So is global memory. The order in which threads are running is also undefined, so if thread 0 is writing a value that thread 128 needs to read back in, you are also not sure that thread 0 has written the value before thread 128 reads it. This can be overcome with syncthreads() offcourse, but I have been bitten by this before.
Going back to the original question: how can you make a local array using shared memory, which minimizes bank conflicts?
The answer is yes, you can do this with a minimum of bank conflicts.
The trick is to understand that if the array index is unpredictable, so is the bank that it accesses, so you’ll likely get a conflict.
EXCEPT…
you can arrange things so that every access, no matter what index you use, will map to the same bank. This is done by making your thread’s data INTERLEAVED with all other threads, so all of your thread’s data is in the same bank. If your block has 32 threads, you may have thread 0 “own” shared memory index of 0, 32, 64, and 96. Then it can access any of these, yet they’ll all be the same bank #0. Similarly, thread 1 “owns” index 1, 33, 65, and 97, so it always accesses bank #1.
The length of your array can be anything. there are no restrictions. The interleaved spacing takes care of the 1:1 thread to bank mapping for you.
I use this in practice to compute a small local lookup table to speed up the compute of A^N mod M. This kind of integer power function usually uses a loop over the bits of N, but this uses 32 tests and branches… (hurting throughput as half the threads have to wait for others to do a multiply.) Building a table made the throughput double since there’s no longer any branching… but it needed a bank-conflict free table.
__device__ unsigned int Dpowmod(unsigned int a, unsigned int n, unsigned int m)
{
extern __shared__ unsigned int sharedData[];
unsigned int *myTable=&(sharedData[threadIdx.x]);
// In this example, block size is known to be 32, so table is spaced by 32, same as shift <<5
// a 4 entry lookup table of a^0 a^1 a^2 and a^3
// Notice interleaved spacing!
myTable[0<<5]=1;
myTable[1<<5]=a;
myTable[2<<5]=Dmultmod(a,a,m);
myTable[3<<5]=Dmultmod(result, a, m);
// initialize using top 2 bits of N as initial lookup.
unsigned int result=myTable[(n>>30)<<5];
for (int i=28; i>=0; i-=2) {
result=Dmultmod(result, result, m);
result=Dmultmod(result, result, m);
result=Dmultmod(result, myTable[(3&(n>>i))<<5],m); // table lookup
}
return result;
}
Programming gotcha to watch out for: If other functions use local tables as well, you have to make sure they don’t share the same table locations, OR insure that you at least don’t call one of the functions from another and expect to keep your table.
This is because your local table in shared memory is local to your THREAD, not really local to your FUNCTION.