Question about dram_read_transactions

Hi,

I am a newbie to cuda programming. I am trying to do very basic profiling to verify my understanding of device memory read. To my understanding, memory is read in 128bytes chunk in the most cases and therefore with the increase of stride, the number of transactions should increase. However, with the following code, sometimes I see that with bigger stride the number of transactions is smaller than the number of transactions with smaller stride. Below are the codes and the command I used to run nvprof.

__global__
void read(float *x, float *y, int stride, int num_reads) {
    int index = 0;
    int lim = num_reads * stride + index;
    for (int i = index; i < lim; i += stride){
        y[0] += x[i];
    }
}
int main(int argc, char **argv)
{
    if (argc != 3) {
        std::cerr << "Usage: " << argv[0] << " <stride>" << std::endl;
        return 1;
    }
    std::cout << "stride = " << atoi(argv[1]) << std::endl;
    int stride = atoi(argv[1]);
    int num_reads = atoi(argv[2]);
    int N = 1<<20;
    float *x, *y;
    cudaMallocManaged(&x, N*sizeof(float));
    cudaMallocManaged(&y, sizeof(float));
    for (int i = 0; i < N; i += 1){
        x[i] = 1.0f;
    }
    y[0] = 0.0f;
    read<<<1, 1>>>(x, y, stride, num_reads);
    cudaDeviceSynchronize();
    std::cout << "Result = " << y[0] << std::endl;
    cudaFree(x);
    cudaFree(y);
    return 0;
}

Profiling command: nvprof --metrics dram_read_transactions ./read_test “stride” “num_reads”

Results:
==1726484== Profiling application: ./read_test 1 1000
==1726484== Profiling result:
==1726484== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device “Tesla V100-SXM2-16GB (0)”
Kernel: read(float*, float*, int, int)
1 dram_read_transactions Device Memory Read Transactions 75 75 75

==1726509== Profiling application: ./read_test 512 1000
==1726509== Profiling result:
==1726509== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device “Tesla V100-SXM2-16GB (0)”
Kernel: read(float*, float*, int, int)
1 dram_read_transactions Device Memory Read Transactions 60 60 60

I greatly appreciate you help on this matter!

when posting code on these forums, please format it.

A simple method would be to edit your post by clicking the pencil icon below it, select the code, then click the </> button at the top of the edit pane, then save your changes.

Please do that now.

Thank you for your suggestion. I have edited my post.

The compiler could either choose to keep y[0] in registers or reread it for every iteration of the loop. Also with the single thread running <<<1,1>>> it could happen that you see memory accesses not related to the direct instructions of your kernel. Sometimes a handful of additional operations are done, probably for initialization purposes.