Shared memory question

ryta1203 · June 18, 2008, 3:25pm

Is it possible to allocate variables in shared memory that will be private to a particular thread in a block? For instance, what is the best way to go about doing something like this:

var1 = some calculated value from global mem vars[thread];
global_var[thread] = var1;

Global memory is just too slow and often ends being uncoalesced when trying to access different offsets from within the same kernel, for instance:

a[(i+1)+const*(j+1)] = b[i+const*j];

gives me uncoalesced writes. If you flip the assignment (b=a) then it gives uncoalesced reads. Thanks in advance.

ryta1203 · June 18, 2008, 6:10pm

A. no one knows
B. no one cares
C. it’s a stupid question

So which was it? 42 views, no responses.

Let me ask a simpler more clear question regarding global memory access and coalescing:

For the sample code:

a[i+size1*j+size1*size2*k] = b[(i+1)+(size1+2)(j+1)+(size1+2)(size2+2)*k]

I get uncoalesced reads (using gld_incoherent from the profiler). What would be a good way to coalesce the reads without sacrificing the coalesced writes?

Ailleur · June 18, 2008, 6:20pm

As you have said, using shared memory as a buffer can get you coalescing both ways. Have each thread read its coalesced value from b to shared , syncthreads, then have each thread write its coalesced value from (uncoalesced) position in shared to a.

As for your original question, i did start to reply, but then figured i didnt understand your question, since, shared memory is… well… shared!

ryta1203 · June 18, 2008, 6:53pm

Using shared memory does not coalesce the accesses for this example. Any other suggestions would be great.

For example:

shared temp[BLOCK_SIZE][BLOCK_SIZE][9];

loop{

temp[threadIdx.x][threadIdx.y][loop] = a[(i+1)+(size1+2)(j+1)+(size1+2)(size2+2)*loop]

}

__syncthreads();

loop{

b[i+size1*j+size1*size2*loop] = temp[threadIdx.x][threadIdx.y][loop]

}

This still gives me the correct results but does not coalesce access.

ryta1203 · June 19, 2008, 12:55am

The solution of simply saying: use shared memory does not work because the alignment of the a array (form above) is not a multiple of 16, so it will never be coalesced. I think I will have to either pad or fan the array out so that every access is a multiple of 16. This is a waste of memory, but as long as I don’t run out then I don’t care. Does this sound right?

Ailleur · June 19, 2008, 1:21am

Im definitely missing something youre telling me then. As long as you have as many threads as array elements, coalescing from a buffer should always be possible, as far as i can think.

Coalesced read to buffer to random access shared memory write

random acces shared memory read to coalecsed write.

I realize just repeating this is probably not gonna help you in the slightest, so i hope for you that someone else jumps in!

Reimar · June 19, 2008, 6:42am

Well, there still is the same issue as with the original question: You give us not the slightest clue what i, j, k and size1, size2 are. At the very least we would have to know how they depend on threadIdx.x.

But either way I think what you are missing is that you can just make thread 0 not read anything and still have a coalesced access, i.e.:

i = threadIdx.x;

if (i) tmp[i-1] = b[i];

__syncthreads();

a[i] = tmp[i];

You will have to think of some way to handle the last element though.

JHHPC · June 19, 2008, 7:29am

Ever thought about to change the layout of the data so that you can coalesce both via shem transfers?

That did the trick for me.

You cannot get really thread private shared memory but you can schedule the access so that every thread accesses only its area.

ryta1203 · June 19, 2008, 3:59pm

My apologies, i, j, size1 and size2 are as follows:

i = blockIdx.x*blockDim.x+threadIdx.x

j = blockIdx.y*blockDim.y+threadIdx.y

size1 = size2 = 256 (at the moment)

k = some loop variable from 0 to 9

What you are trying to do isnt’ going to work in this example, I don’t think, because

b[i] is not the same element as b[(i+1)+(size1+2)(j+1)+(size1)(size2)*k] so
the results would not be accurate
then you have to do something with the elements that were suppose to run on threads that did not run.

I will look into this though, thank you.

ryta1203 · June 19, 2008, 3:59pm

Yes, we are looking at this; however, it’s not that easy and if coalesceing can be acheived without having to do that then that would be great.

ryta1203 · June 19, 2008, 4:45pm

I have two arrays, one of size 2562568, the other of size (256+2)*(256+2)*8

ryta1203 · June 20, 2008, 1:47pm

bump, I was kind of hoping for some responses from answers I gave to the people that asked me questions.

E.D_Riedijk · June 20, 2008, 2:02pm

Coalescing is not always possible, and if possible sometimes not easy. Maybe you will have better performance when fetching the values from a texture?

ryta1203 · June 20, 2008, 4:51pm

My understanding is that textures are read-only. Unfortunately for me, both of these arrays are written to and read from in at 1 of 5 kernels, sometimes both in 1 kernel. Using texture or constant memory for one of the arrays would require a transfer back and forth from CPU-GPU, which would slow down performance even worse then the uncoalesced reads/writes.

Right now, each array takes ~300us to go from CPU to GPU. Times to for GPU->CPU and back CPU->GPU and then times two for the two arrays: 4*300us = ~1200us. The first kernel (which has uncoalesced stores) takes ~450us altogether. The other kernel (which has uncolasced loads) takes ~300us altogether.

Shovelbird · June 20, 2008, 5:57pm

While I agree that coalescing is not always possible, it seems like this is a relatively strait forward candidate for coalescing.

I think what you’re missing is that the same thread does not need to continue to work on the same data, infact in order to coalesce you need to have threads change what they’re working on.

I’m going to go back to your original example:
a[(i+1)+const*(j+1)] = b[i+const*j];

Let’s just put this small array for reference:
[10,19,42,17]

in order to coalesce this you perform the following:

temp[(i+1) + const*(j+1)] = b[i+const*j]; // if i == 0, working with “10”; i==1, Working with “19”
__syncthreads();
if(i) a[i+const*j] = temp[i+const*j]; // same thread, if i == 0, working with null (don’t do that); i==1 Working with “10”

In this scenario you will see all of the array values except for the last item assigned… so you will need a spare thread to pick up the slack; (and you’ll have to assure it doesn’t do anything prior to the syncthreads().

ryta1203 · June 20, 2008, 8:21pm

While I agree that coalescing is not always possible, it seems like this is a relatively strait forward candidate for coalescing.

I think what you’re missing is that the same thread does not need to continue to work on the same data, infact in order to coalesce you need to have threads change what they’re working on.

I’m going to go back to your original example:

a[(i+1)+const*(j+1)] = b[i+const*j];

Let’s just put this small array for reference:

[10,19,42,17]

in order to coalesce this you perform the following:

temp[(i+1) + const*(j+1)] = b[i+const*j]; // if i == 0, working with “10”; i==1, Working with “19”

__syncthreads();

if(i) a[i+const*j] = temp[i+const*j]; // same thread, if i == 0, working with null (don’t do that); i==1 Working with “10”

In this scenario you will see all of the array values except for the last item assigned… so you will need a spare thread to pick up the slack; (and you’ll have to assure it doesn’t do anything prior to the syncthreads().

[snapback]397155[/snapback]

So essentially what you are saying is that I will 2582589 threads (the size of the larger array). Then like this:

if (i < 2562569) // i < size of smallar array

shared_mem[larger_array_stride] = smaller_array[i]

__syncthreads();

if (i > amount of offset between arrays, for me it’s 3. i+1, size+2)

larger_array[larger_array_stride] = shared_mem[i]

Does that look “about” right? I can work out the details myself if that’s the concept. Let me know if you can, many many thanks.

E.D_Riedijk · June 20, 2008, 8:26pm

Not for textures bound to linear memory.

ihaque · June 20, 2008, 10:53pm

This is incorrect and unsafe, at least if I understand what the OP is trying to do.

CUDA does not guarantee that if you write to the linear memory backing the texture, that the texture cache will remain coherent with those writes. In other words, if you write to the global memory behind a texture, a subsequent texture fetch may or may not show the update - it’s undefined.

So, it’s not safe to treat textures as writable in the sense that you can read the new values back out during the same kernel execution. Furthermore, since there aren’t any consistency guarantees, you may not even consistently get the old data back out - you could get a mix of old and new.

E.D_Riedijk · June 21, 2008, 11:25am

So is global memory. The order in which threads are running is also undefined, so if thread 0 is writing a value that thread 128 needs to read back in, you are also not sure that thread 0 has written the value before thread 128 reads it. This can be overcome with syncthreads() offcourse, but I have been bitten by this before.

SPWorley · June 21, 2008, 11:29am

Going back to the original question: how can you make a local array using shared memory, which minimizes bank conflicts?

The answer is yes, you can do this with a minimum of bank conflicts.

The trick is to understand that if the array index is unpredictable, so is the bank that it accesses, so you’ll likely get a conflict.

EXCEPT…

you can arrange things so that every access, no matter what index you use, will map to the same bank. This is done by making your thread’s data INTERLEAVED with all other threads, so all of your thread’s data is in the same bank. If your block has 32 threads, you may have thread 0 “own” shared memory index of 0, 32, 64, and 96. Then it can access any of these, yet they’ll all be the same bank #0. Similarly, thread 1 “owns” index 1, 33, 65, and 97, so it always accesses bank #1.

The length of your array can be anything. there are no restrictions. The interleaved spacing takes care of the 1:1 thread to bank mapping for you.

I use this in practice to compute a small local lookup table to speed up the compute of A^N mod M. This kind of integer power function usually uses a loop over the bits of N, but this uses 32 tests and branches… (hurting throughput as half the threads have to wait for others to do a multiply.) Building a table made the throughput double since there’s no longer any branching… but it needed a bank-conflict free table.

__device__ unsigned int Dpowmod(unsigned int a, unsigned int n, unsigned int m)

{

  extern __shared__ unsigned int sharedData[];

  unsigned int *myTable=&(sharedData[threadIdx.x]);

 // In this example, block size is known to be 32, so table is spaced by 32, same as shift <<5

// a 4 entry lookup table of a^0 a^1 a^2 and a^3

// Notice interleaved spacing!

  myTable[0<<5]=1;

  myTable[1<<5]=a;

  myTable[2<<5]=Dmultmod(a,a,m);

  myTable[3<<5]=Dmultmod(result, a, m);

  

// initialize using top 2 bits of N as initial lookup.

  unsigned int result=myTable[(n>>30)<<5];

    

  for (int i=28; i>=0; i-=2) {

      result=Dmultmod(result, result, m);

      result=Dmultmod(result, result, m);

      result=Dmultmod(result, myTable[(3&(n>>i))<<5],m); // table lookup

   } 

  return result;

}

Programming gotcha to watch out for: If other functions use local tables as well, you have to make sure they don’t share the same table locations, OR insure that you at least don’t call one of the functions from another and expect to keep your table.

This is because your local table in shared memory is local to your THREAD, not really local to your FUNCTION.

Topic		Replies	Views
Coalesced shared memory access? Read and write from which thread to which? CUDA Programming and Performance	23	2733	May 22, 2014
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3696	January 10, 2010
Coalesced Memory access related doubt CUDA Programming and Performance	13	2293	December 9, 2010
Coalescing global memory and avoiding shared bank conflicts Do I need to use this complex of indexin CUDA Programming and Performance	3	3285	March 30, 2009
Memory coalescing in one thread CUDA Programming and Performance	17	16871	March 31, 2011
Using Shared Memory in CUDA C/C++ Technical Blog	36	2400	October 8, 2020
char global memory access optimization CUDA Programming and Performance	17	12080	May 31, 2010
Newbie - Need to use shared mem? CUDA Programming and Performance	27	15334	December 17, 2008
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5422	September 6, 2008
Memory coalescing and multiple arrays CUDA Programming and Performance	23	12157	March 20, 2009

Shared memory question

Related topics