passing different sizes arrays data to shared variables

I have a kernel which passes 3 arrays, the first array d_A1 has no data and is used only to write back data, the other two arrays d_D1 and d_ST1 have data.

The size of the first array is:

d_A1[13000000]

The size of the second array is:

d_D1[421]

The size of the third array is:

d_ST1[21]

N is 13000000

TestArray<<>>(d_A1,N, d_D1, d_ST1);

Now I want only pass the data of d_D1[421] and d_ST1[21] to shared arrays so I created the shared arrays as:

global void TestArray(int* A1, unsigned int N, int* D1, unsigned int* ST1)
{

unsigned int align(16) tid = threadIdx.x;
unsigned int align(16) idx = __umul24(blockDim.x, blockIdx.x) + threadIdx.x;
shared unsigned int align(16) s_D1[441]; //Shared array for d_D1
shared unsigned int align(16) s_ST1[21]; //Shared array for d_ST1

if (idx < N) //13000000

{

  Q. How do I pass the data of d_D1[441] and d_ST1[21] to s_D1[441] and s_ST1[21]?


  I tried:

  while (idx < 441)

    s_D1[tid] = d_D1[idx] 

  __syncthreads(); 


  while (idx < 21)

    s_ST1[tid] = d_ST1[idx] 


  __syncthreads();  

but the computer freezes and I have to restart it.
I also tried one at the time,namely, only the fist
while and then only the second while, with no luck.

If I use the global memory, namely, d_D1, d_ST1 everything works.

So the question is:

How do you pass data to a shared variable/array when the size of the array is not N?

} //End of kernel processing

}

Hmmm. The while loops are the cause of your problem. The threads IDs wont change during execution so every thread gets into a loop and runs it again and again. I think using your CUDA card for running the display at the same time would get the kernel canceled by the OS, but without… When you say you dont get any problems by using global memory I assume you dont use those loops then. Actually what you want is a if clause like

if (tid < 441)

		s_D1[tid] = d_D1[idx];

if (tid < 21)

		s_ST1[tid] = d_ST1[idx];

__syncthreads();

You need to replace idx by tid in those if clauses cause shared memory has the lifetime of a block so you have to fill it with data for each block of threads.

Btw I dont see any reason for aligning the variables holding your IDs. They will be created as registers and dont need to be aligned like eg structs in gmem would need to be. Correct me if Im wrong but i also think that (1D) arrays in smem are also automatically aligned properly, too.

Thanks for replaying, yes you are correct, I do not use those loops when not using the shared memory.

  • You need to replace idx by tid in those if clauses cause shared memory has the lifetime of a block so you have to fill it with data for each block of threads.

Could you please elaborate on that? It only works with s_ST1[tid] = d_ST1[tid], but I do not understand why, since the shared memory does not have only the lifetime of a block, those two shared arrays have to be used for the entire life of N, which is 13000000, so for the entire if (idx < N) { do the test of array }. Also in all the code sample I see if (idx < n) so I do not understand why in this case I need to use if (tid < n), I red every documentation available on the website, but never saw any example using tid as condition.

  • I dont see any reason for aligning the variables holding your IDs. They will be created as registers and dont need to be aligned like eg structs in gmem would need to be. Correct me if Im wrong but i also think that (1D) arrays in smem are also automatically aligned properly, too.

I was not sure about if you could use the aligning of variables, I saw it done in a code sample

Thanks you very much again, you made my day.

From the NVIDIA CUDA Programming Guide 3.0, B.2.3:

So this means that for every block you have to fill your shared array separatly. If you want to use it for read-only access accross multiple blocks you could use texture memory with compute capability < 2.0 cards or L? cache with compute capability 2.0 (I dont know how it works exactly with Fermi cause I dont have my GTX480 yet but use of texture memory isnt adviced with Fermis any longer). Because of the lifetime of shared memory you have to load your data from global into smem once for every block. Thus, you need the “tid” for accessing your smem array. For reading the corresponding values from the gmem array I was mistaken, sorry! You only have 441 elements in there so of course you dont wanna use some index greater than that number. But this is also the reason for not using “idx” for accessing the smem array, cause with lets say 10 blocks of 512 threads each, idx will get up to 5120 resulting in out of bounds errors.

You can find a example about using smem in the Best Practices Guide 3.0 in chapter 3.2.2.3:

__global__ void coalescedMultiply(float *a, float *c, int M)

{

  __shared__ float aTile[TILE_DIM][TILE_DIM],

					transposedTile[TILE_DIM][TILE_DIM];

  int row = blockIdx.y * blockDim.y + threadIdx.y;

  int col = blockIdx.x * blockDim.x + threadIdx.x;

  float sum = 0.0f;

  aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x];

  transposedTile[threadIdx.x][threadIdx.y] =

	   a[(blockIdx.x*blockDim.x + threadIdx.y)*TILE_DIM +

	   threadIdx.x];

  __syncthreads();

  for (int i = 0; i < TILE_DIM; i++) {

	sum += aTile[threadIdx.y][i]* transposedTile[i][threadIdx.x];

  }

  c[row*M+col] = sum;

}

In 3.2.2 you will also find some guidelines for performance optimizations regarding smem you will want to read.

In 3.2.1.2 you can read when you need and what you need align for:

And in 3.2.2.2 it is stated that

You can read about smem bank conflicts in the guidelines I mentioned above… enjoy ;)

Thanks again, I’ll reed the documentation for 3.0 again.

  • In which case you would use if (idx < n) instead of if (tid < n)?

  • To get the coalesced reed on global memory for those two shared, would it work (idxsize+tid) e.g. (idx21+tid)? or would it rather be (blockIdx.x*size+tid) ?

Your variable “idx” is built by blockDim.x * blockIdx.x + threadIdx.x so with lets say 10 blocks and 512 threads each it would range from 0 to 5119. But your shared array does only contain 411 elements so you dont want to use “idx” cause this would end with out of bounds errors and the dead rising from their graves. You would use “tid” cause tid ranges from 0 to 512 for every block so with if(tid < 411) you will make 411 threads in every block reading from gmem and storing in smem. Its all about indexing. If you are not yet used to CUDAs programming model have a look at 2.3 in the CUDA Programming Guide and at the figures there showing how different sorts of memory are related to blocks and threads.

Regarding your 2nd question it comes down to indexing and not going out of bounds again… If your array does only consist of 21 elements why would you use some index greater than that?

Thanks you very much indeed again. Yes I’m new to CUDA.

  • So, Would if(tid < 21) s_ST1[tid] = d_ST1[tid]; be enough to get coalesced read?

Thanks you very much indeed again. Yes I’m new to CUDA.

  • So, Would if(tid < 21) s_ST1[tid] = d_ST1[tid]; be enough to get coalesced read?

Yup.

Thanks and have a nice day!