Using shared memory along with Unified Memory

I’m struggling to create an fdtd code using cuda. It works when I use unified memory, but inside the kernel if I use shared memory it does not work. Any one know if they work together ?

Yes, they work together.

This is part of my kernels code, it used to word when I used the cudaMemCopy malloc and etc. When I switched to cudamallocManaged, it stoped working. It only worked when I removed the shared variables

//helper to calculate threads global index
int a = blockIdx.x * blockDim.x + threadIdx.x;
int b = blockIdx.y * blockDim.y + threadIdx.y;
int c = blockIdx.z * blockDim.z + threadIdx.z;

//threads global index
int threadId = a + b*domain->NX + c * domain->NX * domain->NY;

//shred memory index helper variables
int tx = threadIdx.x;
int ty = threadIdx.y;
int tz = threadIdx.z;

//shared memory variables
__shared__ float s_Ex[WIDTH][WIDTH][WIDTH];
__shared__ float s_Ey[WIDTH][WIDTH][WIDTH];
__shared__ float s_Ez[WIDTH][WIDTH][2*WIDTH+1];

if (threadId < (domain->NX*domain->NY*domain->NZ)) {

	// auxiliar to calculate the coordinates i,j,k of the vector
	int aux_threadId;

	//Mapping 3d to 1d the new index reffered
	// and to E(i,j,k+1)
	// and to E(i,j+1,k)
	// and to E(i+1,j,k)
	int e_threadId_i, e_threadId_k, e_threadId_j;

	e_threadId_i = threadId + 1;

	// calculate the index refered to E(i,j+1,k)
	e_threadId_j = threadId + domain->NX;

	// calculate the index refered to E(i+1,j,k)
	e_threadId_k = threadId + (domain->NX * domain->NY);

	//setting shared memory variables
	s_Ex[tz][ty][tx] = domain->h_Ex[threadId + WIDTH];
	s_Ey[tz][ty][tx] = domain->h_Ey[threadId + WIDTH];
	s_Ey[tz][ty][tx + WIDTH] = domain->h_Ey[threadId + WIDTH];
	s_Ez[tz][ty][tx] = domain->h_Ez[threadId + WIDTH];
	s_Ez[tz][ty][tx + WIDTH] = domain->h_Ez[threadId + WIDTH];


	//synchronize the threads

start by running the failing case with cuda-memcheck