is cudaMalloc3d allocated memory automatically cached into shared memory when accesing it using the

Hi everybody,

I have 3 questions,

1st: cudaMalloc3d() is recommended for allocating 3d arrays, but it allocates in global memory right? it doesn’t make sense unless there is some automatic caching mechanism used behind the scenes that automatically caches to shared memory, because shared memory as u know is much faster than global memory. is that what happens?, because all that I found in the programming guide includes:

2nd: can you help me please understand the above code. as far as I understand, the loop divides the 3d array into slices then each slice into rows and then read row, I don’t get the following statement: float element = row;

what does that mean? in each iteration of the loop around a row a new variable called “element” is created? and where is it allocated? in a register, or shared memory?

3rd: which method gives best performance:

a)allocating 3d arrays using cudaMalloc3d then accessing the elements using the pitch pointer

b)allocating 3d array using cudaMalloc only (as a 1d array), then copy to shared memory as follows (i tried this one and it is working, but I need to know is this a better approach or approach a? ):

_global__ void test(int *d_a, int *d_b){

__shared__ int dd_a[8][8][8];

int ix  = blockIdx.x*blockDim.x + threadIdx.x;

int iy  = blockIdx.y*blockDim.y + threadIdx.y;

int iz  = blockIdx.z*blockDim.z + threadIdx.z;

		dd_a[ix][iy][iz]=d_a[ix*8*8 + iy*8 + iz];

		d_b[ix*8*8 + iy*8 + iz]=dd_a[ix][iy][iz]+1;


Thank you very much