Can I 'swap' memory between shared and global (or global and global) arrays?

dscerutti · January 20, 2024, 10:09am

Hello,

I’m curious if some of the problems I’m seeing in a branch of my code might stem from the fact that I am trying to “swap” memory from arrays in __shared__ to arrays in __global__ memory. The idea is to have the thread block loop over each index of the respective arrays and do the typical swap operation:

__shared__ float sh_array[N_ELEMENTS];
float* gbl_array;

int pos = threadIdx.x;
while (pos < N_ELEMENTS) {
  float tmp = sh_array[pos];
  sh_array[pos] = gbl_array[pos];
  gbl_array[pos] = tmp;
  pos += blockDim.x;
}

Will that require synchronization to get it right? It seems that I may need to do the following:

__shared__ float sh_array[N_ELEMENTS];
float* gbl_array;

int base_pos = 0;
while (base_pos < N_ELEMENTS) {
  const int pos = base_pos + threadIdx.x;
  float tmp;
  if (pos < N_ELEMENTS) {
    tmp = sh_array[pos];
    sh_array[pos] = gbl_array[pos];
  }
  __syncthreads();
  if (pos < N_ELEMENTS) {
    gbl_array[pos] = tmp;
  }
  __syncthreads();
  base_pos += blockDim.x;
}

Can anyone confirm that the problem I think I’m seeing is real and the solution looks sound?

dscerutti · January 20, 2024, 11:13am

Update: I have implemented this solution for my main code, and it seems to fix the problem. I’m not sure if things are completely fine, but this definitely cleaned up some otherwise inexplicable pollution in my numbers. I did not end up needing the second __syncthreads() in the pseudo-code above, as the threads will move on to new indices of the arrays in subsequent iterations of the loop. In my actual code, I do have a __syncthreads() further down to catch the tail end of the writes to ensure that the swap is complete once it’s time to swap back.

Thank you, CUDA engineers! __syncthreads() and __syncwarp() memory barriers are truly impressive.

striker159 · January 20, 2024, 12:37pm

If each block operates on a separate gbl_array, you don’t need any synchronization because different threads access distinct array positions.
If multiple blocks have the same gbl_array, you have a race condition between blocks which cannot be fixed by __syncthreads()

dscerutti · January 20, 2024, 3:41pm

What if I have different blocks operating on the same global array, but different sectors of it? Like, block 0 operates on elemetns 0 through 1023, block 1 operates on elements 1024 through 2048… I think that’s OK. In effect, the gbl_array pointer is unique to each block, although if a block overruns its bounds (which I am careful to prevent) then yes, __syncthreads() would not help me with that, I agree.

But I did seem to fix a lot of problems when I began adding the synchronization between reads from some location in global memory and writes to it, even though any given address is only operated upon by a single thread. Are you sure tha no synchronization should be needed?

striker159 · January 20, 2024, 4:42pm

Yes when all blocks are working with their own non-overlapping sections its fine. (It’s not obvious from the code snippet since all blocks would use the same pos when only threadIdx.x is used.

I am sure that if an address is only accessed by a single thread there cannot be a data race for this address. No synchronization required.

dscerutti · January 20, 2024, 11:25pm

Yes, sorry I didn’t make the distinct sectors clear in my example. If this is true, then, I need to review what I’ve done, because I may have merely created a Heisenbug. It certainly went from “very bad, always bad” to “pretty quiet” as soon as I did what I describe above, and the new synchronization isn’t really much most costly, so far as I can tell, than the old one, so I don’t think I’ve changed the code in a way that would drastically tamp down on some other collision that’s happening. I will continue to investigate…

njuffa · January 20, 2024, 11:46pm

Have you run this code with compute-sanitizer? It can find many instances of race conditions, though not all of them. It also seems possible that you could have an out-of-bounds access somewhere which may not be obvious if it is an off-by-one error that does not trigger a memory access violation. Adding __sync_threads() may be merely masking the root cause of the observed data corruption in most cases, as you already suspected yourself.

dscerutti · January 21, 2024, 12:03am

I have been running compute-sanitzer, but I will do more of that.

dscerutti · January 21, 2024, 8:58am

@striker159 @njuffa Thanks for insisting. Indeed, I just removed the excess synchronization, and the code continues to produce identical results after a 21-minute run. All that really happened is that the code finishes about 5% faster.

dscerutti · January 27, 2024, 2:49pm

CLosing this topic as I found the (very sneaky) race condition that was getting in the way. Marching on!

system · February 10, 2024, 2:50pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
__syncthreads() and global memory CUDA Programming and Performance	1	2508	December 1, 2008
inter-block communication via global memory why my simple implementation failed? CUDA Programming and Performance	7	14501	December 4, 2007
using syncthreads still at n00b status CUDA Programming and Performance	4	16098	December 1, 2010
Call to _syncThreads() not needed? CUDA Programming and Performance	2	822	March 10, 2015
__syncthreads and shared memory CUDA Programming and Performance	21	4640	June 15, 2011
Synchronization, threadfence, random memory access beginner questions CUDA Programming and Performance	7	2781	April 9, 2012
does this code have problem? CUDA Programming and Performance	6	3957	December 9, 2007
CUDA Warp Synchronization Problem CUDA Programming and Performance	5	2255	February 27, 2011
Reduction: shared VS global memory CUDA Programming and Performance	4	7797	June 1, 2008
Global Memory write per block ? CUDA Programming and Performance	3	2029	June 9, 2008

Can I 'swap' memory between __shared__ and __global__ (or __global__ and __global__) arrays?

Related topics

Can I 'swap' memory between shared and global (or global and global) arrays?