Profiling simple shared memory transactions

Hi,
I am profiling a simple kernel to test shared memory access patterns on several devices, each with compute capability 2.1. The kernel solely performs one load and one store to shared memory. Say I launch the kernel like:

kernel<<<1,(32,32)>>>(int *d_in)

with the global array d_in containing exactly 1024 elements. In the kernel I expect no bank conflicts and only one load/store each. When I launch the profiler like

nvprof --metrics shared_load_transactions_per_request,shared_store_transactions_per_request ./myExecutable

I notice that the reported metrics scale with the number of SM’s; i.e., on a device with 2 SM’s the profile returns 2 transactions each, and on a device with 4 SM’s it returns 4 transactions each.

Is it correct that the two metric queries above return on an SM basis? If so, if I’m launching only 1 block, why does each SM report transactions - I would expect only 1 SM to be active for a one-block kernel call.

Thanks

It’s an artifact of the way the proiler is doing the sampling to compute the metric. The profiler, I think, in this case (Fermi) is sampling data (shared transactions, and warp requests) based on a quad-SM (GPC) basis, and then performing various multiplications to scale the numerator and denominator.

If you want a precise explanation, it may have to wait for Greg@NV or someone else to come along.

The usual recommendation is not to attempt to deduce profiler data from a single threadblock. Ideally you want enough threadblocks to fill all SMs, preferably with no tail effect. But usually just launching a “large” number of threadblocks is enough to get profiler data that is pretty close, statistically, to being “accurate”.

Here’s a simple test case:
Q5000 has I think 11 Fermi SMs, but they are grouped into quads (GPC’s), GT 640 only has 2 SMs (and they are Kepler SMs)

$ cat t904.cu
#include <stdio.h>

__global__ void kernel(int *d_in){

  __shared__ int sdata[1024];
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  sdata[threadIdx.x] = d_in[idx];
  __syncthreads();
  d_in[idx] = sdata[1023-threadIdx.x];
}

int main(int argc, char* argv[]){

  int nBLK = 1;
  if (argc > 1) nBLK = atoi(argv[1]);
  int *data;
  cudaMalloc(&data, nBLK*1024*sizeof(int));
  kernel<<<nBLK,dim3(32,32)>>>(data);
  cudaDeviceSynchronize();
}
$ nvcc -o t904 t904.cu
$ nvprof --metrics shared_load_transactions_per_request,shared_store_transactions_per_request ./t904                         ==7947== NVPROF is profiling process 7947, command: ./t904
==7947== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==7947== Replaying kernel "kernel(int*)" (done)
==7947== Profiling application: ./t904
==7947== Profiling result:
==7947== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Quadro 5000 (0)"
        Kernel: kernel(int*)
          1      shared_load_transactions_per_request  Shared Memory Load Transactions Per Requ    3.656250    3.656250    3.656250
          1     shared_store_transactions_per_request  Shared Memory Store Transactions Per Req    3.656250    3.656250    3.656250
$ CUDA_VISIBLE_DEVICES="1" nvprof --metrics shared_load_transactions_per_request,shared_store_transactions_per_request ./t904
==7957== NVPROF is profiling process 7957, command: ./t904
==7957== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==7957== Replaying kernel "kernel(int*)" (done)
==7957== Profiling application: ./t904
==7957== Profiling result:
==7957== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GT 640 (0)"
        Kernel: kernel(int*)
          1      shared_load_transactions_per_request  Shared Memory Load Transactions Per Requ    1.000000    1.000000    1.000000
          1     shared_store_transactions_per_request  Shared Memory Store Transactions Per Req    1.000000    1.000000    1.000000
$ nvprof --metrics shared_load_transactions_per_request,shared_store_transactions_per_request ./t904 1000                    ==7966== NVPROF is profiling process 7966, command: ./t904 1000
==7966== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==7966== Replaying kernel "kernel(int*)" (done)
==7966== Profiling application: ./t904 1000
==7966== Profiling result:
==7966== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Quadro 5000 (0)"
        Kernel: kernel(int*)
          1      shared_load_transactions_per_request  Shared Memory Load Transactions Per Requ    1.001000    1.001000    1.001000
          1     shared_store_transactions_per_request  Shared Memory Store Transactions Per Req    1.001000    1.001000    1.001000
$ CUDA_VISIBLE_DEVICES="1" nvprof --metrics shared_load_transactions_per_request,shared_store_transactions_per_request ./t904 1000
==7975== NVPROF is profiling process 7975, command: ./t904 1000
==7975== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==7975== Replaying kernel "kernel(int*)" (done)
==7975== Profiling application: ./t904 1000
==7975== Profiling result:
==7975== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GT 640 (0)"
        Kernel: kernel(int*)
          1      shared_load_transactions_per_request  Shared Memory Load Transactions Per Requ    1.000000    1.000000    1.000000
          1     shared_store_transactions_per_request  Shared Memory Store Transactions Per Req    1.003031    1.003031    1.003031
$

We see that when launching only 1 block, Q5000 gives some number that is less than 4, but GT 640 with Kepler SMs gives 1. (Actually GT640 has 2 Kepler SMs. I believe the difference in behavior is due to a difference in how the profiler samples Fermi vs. Kepler SM behavior.)

When launching lots of blocks, the Q5000 and GT640 numbers both are approximately 1.

Thanks, this makes sense. I’ve run with a larger number of blocks and see numbers closer to what is expected.

Cheers!