Shared memory inconsistent results

capitanoano95 · July 17, 2023, 1:36pm

Hello,

I am trying to implement a shared memory optimization on a piece of code but I see inconsistent results.

In this code below I am looping on the z dimension for every XY plane in a block, iteratively loading values in the shared memory as i go up in Z.

Over the ex, ey and ez arrays updates, you can see commented out a global memory access to those same values previously loaded into shared memory. Well, that works perfectly.

The problem arises when I use shared memory. I get completely different results at every run.

All the compute sanitizer checks return 0 error/warning/info.

The shared arrays should hold the same exact values as those in gblbal memory, but they do not.

Also, I noted that using less than 4 of these shared arrays does not produce erroneous results (in release mode). Any combination.

At this point I am not even sure how to debug this.

Any help is super appreciated.

#include<stdio.h>

#include "constantsGPU.h"
#define assert( X , i, j, k, str) if ( !(X)) { printf("%s failed at %d, %d, %d \n", str, i, j, k ); return; }

#define BLOCKSIZE 16

__global__ void EStepPMLNew( float* ex, float* ey, float* ez,
	float* hx,float* hy, float* hz, 
	bool dbg)
{
	int bx = blockIdx.x;
	int by = blockIdx.y;
	int tx = threadIdx.x;
	int ty = threadIdx.y;

	int i = bx*blockDim.x + tx;
	int j = by*blockDim.y + ty;

	int index = i + j*wE;

	__shared__ float  currHx[BLOCKSIZE+1][BLOCKSIZE+1];
	__shared__ float  currHy[BLOCKSIZE+1][BLOCKSIZE+1];
	__shared__ float  currHz[BLOCKSIZE+1][BLOCKSIZE+1];
	__shared__ float aboveHx[BLOCKSIZE][BLOCKSIZE];
	__shared__ float aboveHy[BLOCKSIZE][BLOCKSIZE];

	if(i < Nx && j < Ny){

	// load the first plane of H
		currHx[tx][ty] = hx[i+j*wHx];
		currHy[tx][ty] = hy[i+j*wHy];
		currHz[tx][ty] = hz[i+j*wHz];

		if(tx == BLOCKSIZE-1){
			currHy[tx+1][ty] = hy[i+1+j*wHy];
			currHz[tx+1][ty] = hz[i+1+j*wHz];
		}
		if(ty == BLOCKSIZE-1){
			currHx[tx][ty+1] = hx[i+(j+1)*wHx];
			currHz[tx][ty+1] = hz[i+(j+1)*wHz];
		}

	}
	__syncthreads();


	for(int z = 0; z < Nz; z++){

		if (i < Nx && j < Ny) {
			aboveHx[tx][ty] = hx[i+j*wHx+(z+1)*whHx];
			aboveHy[tx][ty] = hy[i+j*wHy+(z+1)*whHy];
		}
		__syncthreads();

		
		if (i < Nx && j < Ny) {

			// assert shared memory is correct
			int k = z;

			// ex[index + z*whE] -= d_dtmu*(d_invdZ*(hy[i+j*wHy+(k+1)*whHy]-hy[i+j*wHy+k*whHy])
			// 	-d_invdY*(hz[i+(j+1)*wHz+k*whHz]-hz[i+j*wHz+k*whHz]));
			ex[index + z*whE] -= d_dtmu*(d_invdZ*(aboveHy[tx][ty]-currHy[tx][ty])
				-d_invdY*(currHz[tx][ty+1]-currHz[tx][ty]));

			// ey[index + z*whE] -= d_dtmu*(d_invdX*(hz[i+1+j*wHz+k*whHz]-hz[i+j*wHz+k*whHz])
			// 	-d_invdZ*(hx[i+j*wHx+(k+1)*whHx]-hx[i+j*wHx+k*whHx]));
			ey[index + z*whE] -= d_dtmu*(d_invdX*(currHz[tx+1][ty]-currHz[tx][ty])
				-d_invdZ*(aboveHx[tx][ty]-currHx[tx][ty]));
			
			// ez[index + z*whE] -= d_dtmu*(d_invdY*(hx[i+(j+1)*wHx+k*whHx]-hx[i+j*wHx+k*whHx])
			// 	-d_invdX*(hy[i+1+j*wHy+k*whHy]-hy[i+j*wHy+k*whHy]));
			ez[index + z*whE] -= d_dtmu*(d_invdY*(currHx[tx][ty+1]-currHx[tx][ty])
				-d_invdX*(currHy[tx+1][ty]-currHy[tx][ty]));
		}
		__syncthreads();

		if (i < Nx && j < Ny) {

			if (z < Nz-1) {	
				if(tx == BLOCKSIZE-1){
					currHy[tx+1][ty] = hy[i+1+j*wHy+(z+1)*whHy];
					currHz[tx+1][ty] = hz[i+1+j*wHz+(z+1)*whHz];
				}
				if(ty == BLOCKSIZE-1){
					currHx[tx][ty+1] = hx[i+(j+1)*wHx+(z+1)*whHx];
					currHz[tx][ty+1] = hz[i+(j+1)*wHz+(z+1)*whHz];
				}
				currHx[tx][ty] = aboveHx[tx][ty];
				currHy[tx][ty] = aboveHy[tx][ty];
				currHz[tx][ty] = hz[i+j*wHz+(z+1)*whHz];
			}

		}
		__syncthreads();

	}
}

Robert_Crovella · July 17, 2023, 1:49pm

Pick a value that ends up wrong. Put in-kernel printf at various points in your code that display the various values used in calculating that value. To avoid getting overwhelmed with print-out data, only print the components for a single value, e.g. for/from a single thread, using conditional checks before printing, based on block indices and thread indices. If it helps, reduce the problem size to the smallest that still shows calculation errors.

You can also do something similar with cuda-gdb. Session 12 of this online tutorial gives enough information to train you to be able to do that. It basically requires knowledge of how to set breakpoints, how to step through code, and how to inspect data, all of which is covered there.

capitanoano95 · July 18, 2023, 8:13am

Thank you. I was already doing it but reducing it to one single thread helped me find the issue. Thanks again

system · August 1, 2023, 8:13am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Inconsistent results across multiple runs? CUDA Programming and Performance	0	4174	October 4, 2008
Question on Shared Memory access by multiple threads CUDA Programming and Performance cuda , kernel	4	58	March 11, 2025
Shared memory coherance problem within a thread CUDA Programming and Performance	3	760	November 17, 2010
Shared Memory Again What is happening Here CUDA Programming and Performance	11	2173	June 5, 2009
Problems when using shared memory CUDA Programming and Performance cuda	1	335	May 15, 2024
bug in cuda-memcheck CUDA-MEMCHECK	1	3049	January 17, 2013
cuda fortran fdtd code and i have problem with shared memory Legacy PGI Compilers	1	1830	August 15, 2016
using a component in the share memory twice CUDA Programming and Performance	4	2615	September 22, 2010
COMPUTE-SANITIZER " Invalid __shared__ write of size 16 bytes " CUDA Programming and Performance	3	1228	August 17, 2022
Read/write out of shared memory CUDA Programming and Performance cuda , programming	3	53	October 9, 2024

Shared memory inconsistent results

Related topics