Is it num_z the total number of 3D in z direction?
If I understand the code right, it is taking care of the [idx, idy, all idz%blockDim.z==threadIdx.z] in one call.
That’s very insightful. External Image
A second thought on shared memory…
Since all the data in c are accessed in one thread block, should I load the whole c into shared memory if that is big enough?