No speedup with async shared memory in stencil

gozen · July 7, 2021, 5:07pm

Hi All,

I have implemented a stencil using LDGSTS, however, I do not gain any speedup. My code is below. Is there anything wrong? The performances are below.

  25.2         21499903        100  214999.0   182369   230817  stencil_1d_fast_async(double*, double*, double*)
  24.4         20805889        100  208058.9   177025   222497  stencil_1d_slow(double*, double*, double*)

__global__   void stencil_1d_slow(T *in, T *in2, T *out) {
	int gindex = threadIdx.x + blockIdx.x * blockDim.x;
  T result = 0;
	for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
		result += in[gindex + offset] * in2[gindex - offset];
	out[gindex] = result;
}

__global__   void stencil_1d_fast_async(T *in, T *in2, T *out) {
	__shared__ T temp [BLOCK_SIZE + 2 * RADIUS];
	__shared__ T temp2[BLOCK_SIZE + 2 * RADIUS];
	int gindex = threadIdx.x + blockIdx.x * blockDim.x;
	int lindex = threadIdx.x + RADIUS;

	// Read input elements into shared memory
	__pipeline_memcpy_async(&temp[lindex] , &in[gindex], sizeof(T));
	__pipeline_memcpy_async(&temp2[lindex], &in2[gindex], sizeof(T));	
  //irregular load
	if (threadIdx.x < RADIUS) {
		__pipeline_memcpy_async(&temp[lindex - RADIUS]			, &in[gindex - RADIUS], sizeof(T));
		__pipeline_memcpy_async(&temp[lindex + BLOCK_SIZE] 	, &in[gindex + BLOCK_SIZE], sizeof(T));
		__pipeline_memcpy_async(&temp2[lindex - RADIUS]			, &in2[gindex - RADIUS], sizeof(T));
		__pipeline_memcpy_async(&temp2[lindex + BLOCK_SIZE] , &in2[gindex + BLOCK_SIZE], sizeof(T));		
	}
	// Synchronize (ensure all the data is available)
	__pipeline_commit();
  __pipeline_wait_prior(0);
	// Apply the stencil
	T result = 0;
	for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
		result += temp[lindex + offset] * temp2[lindex - offset];
	// Store the result
	out[gindex] = result;
}

Robert_Crovella · July 7, 2021, 5:49pm

I usually suggest that performance questions provide a complete test case. You’re welcome to do as you wish of course. A few comments anyway:

Nothing obviously jumps out at me as “wrong”. If you have verified that you get the correct results in each case and also run your codes with compute-sanitizer, you will reduce the likelihood that you are doing something “wrong”. I haven’t studied your code carefully, and wouldn’t do so without a full test case anyway (so, for example, I could use tools such as compute-sanitizer). I’m not going to write my own test harness to wrap around someone else’s kernel code.
It’s not directly the question you asked, but the async memcpy operations should not be expected to perform better than an “ordinary” global->shared load, if you are doing a wait immediately after committing the work (and the full asynchronous character is only available on cc8.0 and later). One of the principle benefits of the async version over the “ordinary” version that people have been doing since day 1 with CUDA is if you have other work that your kernel code can do while the async operation proceeds (especially if that is compute-bounded work).
Depending on your RADIUS and other factors which are not deducible from what you have shown, I would guess that it is entirely possible for the shared-optimized 1d stencil (whether async or not - see item 1 above) to have little benefit over the non-shared. The async operations are only supported on cc7.x (and higher). Depending on which GPU you are running on, you may as much as 40MB of L2 cache. If the L2 cache is “large relative to your dataset size” or simply large enough to support the working footprint of the thread support on your GPU, then it might be that shared memory provides little additional benefit, because both shared and L2 will provide benefit in the case of data reuse. In fact, NVIDIA began touting (for example, see slide 10 here) with the Volta generation of GPUs the possibility that the usual shared optimizations for data reuse might provide diminishing returns due to both larger L1 and L2 cache structures in newer GPUs.