__ldg profiling against L2 in maxwell

Hello All,

I wanted to profile the performance of global memory access in maxwell architecture (Jetson nano).

I implemented the following kernel that simply copy the data from a vector A to B.
using __ldg :

__global__ void accessGlobalMemoryLDG(const typeImg* __restrict__ A, typeImg* B, const int n) 
        {
            unsigned i = ((blockIdx.x) * (blockDim.x)) + (threadIdx.x);
            for(int idx = 0; idx < (n); idx++)
            {
                 B[idx ] = __ldg(&A[idx]);
            }
        }

using L2 cache

__global__ void accessGlobalMemoryL1L2( typeImg* A, typeImg* B, const int n)
    {
        unsigned i = ((blockIdx.x) * (blockDim.x)) + (threadIdx.x);
        for(int idx = 0; idx < (n); idx++)
        {
            B[idx] = A[idx];
        }
   }

and the results for n= 12*1024 bytes

N#Block, N#Thread LDG L2
(1,32) 0.71 2.36
(1,64 ) 0.75 2.365
(8,32) 0.9 2.34
(8,64) 1.27 2.34
(32,32) 2.31 2.41

Can anyone please explain why for bigger kernel grid and block configuration (32,32) the performance of ldg is close to the performance of L2.

Thanks.

I have no idea what the column of performance data is supposed to indicate. milliseconds of kernel execution time? gigabytes per second?

It also seems curious that you would design benchmark code where every thread in the block, indeed every thread in the grid, is moving the exact same piece of data from the same source location to the same destination location, at each iteration of the loop.

Leaving all that aside, it’s unclear what you were expecting.

Copying data from location X to non-overlapping location Y should be mostly unaffected by caches. The only modifier to this might be if the caches were already populated with some data, such as via a previous cudaMemcpy operation.

__ldg indicates caching vi read-only cache if possible. Both transactions as you indicate would populate the L2.

Generally, I wouldn’t expect much difference between these two methods, barring the above items.