Hello All,

I wanted to profile the performance of global memory access in maxwell architecture (Jetson nano).

I implemented the following kernel that simply copy the data from a vector A to B.

using __ldg :

```
__global__ void accessGlobalMemoryLDG(const typeImg* __restrict__ A, typeImg* B, const int n)
{
unsigned i = ((blockIdx.x) * (blockDim.x)) + (threadIdx.x);
for(int idx = 0; idx < (n); idx++)
{
B[idx ] = __ldg(&A[idx]);
}
}
```

using L2 cache

```
__global__ void accessGlobalMemoryL1L2( typeImg* A, typeImg* B, const int n)
{
unsigned i = ((blockIdx.x) * (blockDim.x)) + (threadIdx.x);
for(int idx = 0; idx < (n); idx++)
{
B[idx] = A[idx];
}
}
```

and the results for n= 12*1024 bytes

N#Block, N#Thread | LDG | L2 |
---|---|---|

(1,32) | 0.71 | 2.36 |

(1,64 ) | 0.75 | 2.365 |

(8,32) | 0.9 | 2.34 |

(8,64) | 1.27 | 2.34 |

(32,32) | 2.31 | 2.41 |

Can anyone please explain why for bigger kernel grid and block configuration (32,32) the performance of ldg is close to the performance of L2.

Thanks.