I’m struggling to create an fdtd code using cuda. It works when I use unified memory, but inside the kernel if I use shared memory it does not work. Any one know if they work together ?
Yes, they work together.
This is part of my kernels code, it used to word when I used the cudaMemCopy malloc and etc. When I switched to cudamallocManaged, it stoped working. It only worked when I removed the shared variables
//helper to calculate threads global index
int a = blockIdx.x * blockDim.x + threadIdx.x;
int b = blockIdx.y * blockDim.y + threadIdx.y;
int c = blockIdx.z * blockDim.z + threadIdx.z;
//threads global index
int threadId = a + b*domain->NX + c * domain->NX * domain->NY;
//shred memory index helper variables
int tx = threadIdx.x;
int ty = threadIdx.y;
int tz = threadIdx.z;
//shared memory variables
__shared__ float s_Ex[WIDTH][WIDTH][WIDTH];
__shared__ float s_Ey[WIDTH][WIDTH][WIDTH];
__shared__ float s_Ez[WIDTH][WIDTH][2*WIDTH+1];
if (threadId < (domain->NX*domain->NY*domain->NZ)) {
// auxiliar to calculate the coordinates i,j,k of the vector
int aux_threadId;
//Mapping 3d to 1d the new index reffered
// and to E(i,j,k+1)
// and to E(i,j+1,k)
// and to E(i+1,j,k)
int e_threadId_i, e_threadId_k, e_threadId_j;
e_threadId_i = threadId + 1;
// calculate the index refered to E(i,j+1,k)
e_threadId_j = threadId + domain->NX;
// calculate the index refered to E(i+1,j,k)
e_threadId_k = threadId + (domain->NX * domain->NY);
//setting shared memory variables
s_Ex[tz][ty][tx] = domain->h_Ex[threadId + WIDTH];
s_Ey[tz][ty][tx] = domain->h_Ey[threadId + WIDTH];
s_Ey[tz][ty][tx + WIDTH] = domain->h_Ey[threadId + WIDTH];
s_Ez[tz][ty][tx] = domain->h_Ez[threadId + WIDTH];
s_Ez[tz][ty][tx + WIDTH] = domain->h_Ez[threadId + WIDTH];
//synchronize the threads
__syncthreads();
start by running the failing case with cuda-memcheck