Shared memory question

Is it possible to allocate variables in shared memory that will be private to a particular thread in a block? For instance, what is the best way to go about doing something like this:

var1 = some calculated value from global mem vars[thread];
global_var[thread] = var1;

Global memory is just too slow and often ends being uncoalesced when trying to access different offsets from within the same kernel, for instance:

a[(i+1)+const*(j+1)] = b[i+const*j];

gives me uncoalesced writes. If you flip the assignment (b=a) then it gives uncoalesced reads. Thanks in advance.

A. no one knows
B. no one cares
C. it’s a stupid question

So which was it? 42 views, no responses.

Let me ask a simpler more clear question regarding global memory access and coalescing:

For the sample code:

a[i+size1*j+size1*size2*k] = b[(i+1)+(size1+2)(j+1)+(size1+2)(size2+2)*k]

I get uncoalesced reads (using gld_incoherent from the profiler). What would be a good way to coalesce the reads without sacrificing the coalesced writes?

As you have said, using shared memory as a buffer can get you coalescing both ways. Have each thread read its coalesced value from b to shared , syncthreads, then have each thread write its coalesced value from (uncoalesced) position in shared to a.

As for your original question, i did start to reply, but then figured i didnt understand your question, since, shared memory is… well… shared!

Using shared memory does not coalesce the accesses for this example. Any other suggestions would be great.

For example:

shared temp[BLOCK_SIZE][BLOCK_SIZE][9];

loop{

temp[threadIdx.x][threadIdx.y][loop] = a[(i+1)+(size1+2)(j+1)+(size1+2)(size2+2)*loop]

}

__syncthreads();

loop{

b[i+size1*j+size1*size2*loop] = temp[threadIdx.x][threadIdx.y][loop]

}

This still gives me the correct results but does not coalesce access.

The solution of simply saying: use shared memory does not work because the alignment of the a array (form above) is not a multiple of 16, so it will never be coalesced. I think I will have to either pad or fan the array out so that every access is a multiple of 16. This is a waste of memory, but as long as I don’t run out then I don’t care. Does this sound right?

Im definitely missing something youre telling me then. As long as you have as many threads as array elements, coalescing from a buffer should always be possible, as far as i can think.

Coalesced read to buffer to random access shared memory write

random acces shared memory read to coalecsed write.

I realize just repeating this is probably not gonna help you in the slightest, so i hope for you that someone else jumps in!

Well, there still is the same issue as with the original question: You give us not the slightest clue what i, j, k and size1, size2 are. At the very least we would have to know how they depend on threadIdx.x.

But either way I think what you are missing is that you can just make thread 0 not read anything and still have a coalesced access, i.e.:

i = threadIdx.x;

if (i) tmp[i-1] = b[i];

__syncthreads();

a[i] = tmp[i];

You will have to think of some way to handle the last element though.

Ever thought about to change the layout of the data so that you can coalesce both via shem transfers?

That did the trick for me.

You cannot get really thread private shared memory but you can schedule the access so that every thread accesses only its area.

My apologies, i, j, size1 and size2 are as follows:

i = blockIdx.x*blockDim.x+threadIdx.x

j = blockIdx.y*blockDim.y+threadIdx.y

size1 = size2 = 256 (at the moment)

k = some loop variable from 0 to 9

What you are trying to do isnt’ going to work in this example, I don’t think, because

  1. b[i] is not the same element as b[(i+1)+(size1+2)(j+1)+(size1)(size2)*k] so

  2. the results would not be accurate

  3. then you have to do something with the elements that were suppose to run on threads that did not run.

I will look into this though, thank you.

Yes, we are looking at this; however, it’s not that easy and if coalesceing can be acheived without having to do that then that would be great.

I have two arrays, one of size 2562568, the other of size (256+2)*(256+2)*8

bump, I was kind of hoping for some responses from answers I gave to the people that asked me questions.

Coalescing is not always possible, and if possible sometimes not easy. Maybe you will have better performance when fetching the values from a texture?

My understanding is that textures are read-only. Unfortunately for me, both of these arrays are written to and read from in at 1 of 5 kernels, sometimes both in 1 kernel. Using texture or constant memory for one of the arrays would require a transfer back and forth from CPU-GPU, which would slow down performance even worse then the uncoalesced reads/writes.

Right now, each array takes ~300us to go from CPU to GPU. Times to for GPU->CPU and back CPU->GPU and then times two for the two arrays: 4*300us = ~1200us. The first kernel (which has uncoalesced stores) takes ~450us altogether. The other kernel (which has uncolasced loads) takes ~300us altogether.

While I agree that coalescing is not always possible, it seems like this is a relatively strait forward candidate for coalescing.

I think what you’re missing is that the same thread does not need to continue to work on the same data, infact in order to coalesce you need to have threads change what they’re working on.

I’m going to go back to your original example:
a[(i+1)+const*(j+1)] = b[i+const*j];

Let’s just put this small array for reference:
[10,19,42,17]

in order to coalesce this you perform the following:

temp[(i+1) + const*(j+1)] = b[i+const*j]; // if i == 0, working with “10”; i==1, Working with “19”
__syncthreads();
if(i) a[i+const*j] = temp[i+const*j]; // same thread, if i == 0, working with null (don’t do that); i==1 Working with “10”

In this scenario you will see all of the array values except for the last item assigned… so you will need a spare thread to pick up the slack; (and you’ll have to assure it doesn’t do anything prior to the syncthreads().

So essentially what you are saying is that I will 2582589 threads (the size of the larger array). Then like this:

if (i < 2562569) // i < size of smallar array

shared_mem[larger_array_stride] = smaller_array[i]

__syncthreads();

if (i > amount of offset between arrays, for me it’s 3. i+1, size+2)

larger_array[larger_array_stride] = shared_mem[i]

Does that look “about” right? I can work out the details myself if that’s the concept. Let me know if you can, many many thanks.

Not for textures bound to linear memory.

This is incorrect and unsafe, at least if I understand what the OP is trying to do.

CUDA does not guarantee that if you write to the linear memory backing the texture, that the texture cache will remain coherent with those writes. In other words, if you write to the global memory behind a texture, a subsequent texture fetch may or may not show the update - it’s undefined.

So, it’s not safe to treat textures as writable in the sense that you can read the new values back out during the same kernel execution. Furthermore, since there aren’t any consistency guarantees, you may not even consistently get the old data back out - you could get a mix of old and new.

So is global memory. The order in which threads are running is also undefined, so if thread 0 is writing a value that thread 128 needs to read back in, you are also not sure that thread 0 has written the value before thread 128 reads it. This can be overcome with syncthreads() offcourse, but I have been bitten by this before.

Going back to the original question: how can you make a local array using shared memory, which minimizes bank conflicts?

The answer is yes, you can do this with a minimum of bank conflicts.

The trick is to understand that if the array index is unpredictable, so is the bank that it accesses, so you’ll likely get a conflict.

EXCEPT…

you can arrange things so that every access, no matter what index you use, will map to the same bank. This is done by making your thread’s data INTERLEAVED with all other threads, so all of your thread’s data is in the same bank. If your block has 32 threads, you may have thread 0 “own” shared memory index of 0, 32, 64, and 96. Then it can access any of these, yet they’ll all be the same bank #0. Similarly, thread 1 “owns” index 1, 33, 65, and 97, so it always accesses bank #1.

The length of your array can be anything. there are no restrictions. The interleaved spacing takes care of the 1:1 thread to bank mapping for you.

I use this in practice to compute a small local lookup table to speed up the compute of A^N mod M. This kind of integer power function usually uses a loop over the bits of N, but this uses 32 tests and branches… (hurting throughput as half the threads have to wait for others to do a multiply.) Building a table made the throughput double since there’s no longer any branching… but it needed a bank-conflict free table.

__device__ unsigned int Dpowmod(unsigned int a, unsigned int n, unsigned int m)

{

  extern __shared__ unsigned int sharedData[];

  unsigned int *myTable=&(sharedData[threadIdx.x]);

 // In this example, block size is known to be 32, so table is spaced by 32, same as shift <<5

// a 4 entry lookup table of a^0 a^1 a^2 and a^3

// Notice interleaved spacing!

  myTable[0<<5]=1;

  myTable[1<<5]=a;

  myTable[2<<5]=Dmultmod(a,a,m);

  myTable[3<<5]=Dmultmod(result, a, m);

  

// initialize using top 2 bits of N as initial lookup.

  unsigned int result=myTable[(n>>30)<<5];

    

  for (int i=28; i>=0; i-=2) {

      result=Dmultmod(result, result, m);

      result=Dmultmod(result, result, m);

      result=Dmultmod(result, myTable[(3&(n>>i))<<5],m); // table lookup

   } 

  return result;

}

Programming gotcha to watch out for: If other functions use local tables as well, you have to make sure they don’t share the same table locations, OR insure that you at least don’t call one of the functions from another and expect to keep your table.

This is because your local table in shared memory is local to your THREAD, not really local to your FUNCTION.