Hi,
I am a newbie to cuda programming. I am trying to do very basic profiling to verify my understanding of device memory read. To my understanding, memory is read in 128bytes chunk in the most cases and therefore with the increase of stride, the number of transactions should increase. However, with the following code, sometimes I see that with bigger stride the number of transactions is smaller than the number of transactions with smaller stride. Below are the codes and the command I used to run nvprof.
__global__
void read(float *x, float *y, int stride, int num_reads) {
int index = 0;
int lim = num_reads * stride + index;
for (int i = index; i < lim; i += stride){
y[0] += x[i];
}
}
int main(int argc, char **argv)
{
if (argc != 3) {
std::cerr << "Usage: " << argv[0] << " <stride>" << std::endl;
return 1;
}
std::cout << "stride = " << atoi(argv[1]) << std::endl;
int stride = atoi(argv[1]);
int num_reads = atoi(argv[2]);
int N = 1<<20;
float *x, *y;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, sizeof(float));
for (int i = 0; i < N; i += 1){
x[i] = 1.0f;
}
y[0] = 0.0f;
read<<<1, 1>>>(x, y, stride, num_reads);
cudaDeviceSynchronize();
std::cout << "Result = " << y[0] << std::endl;
cudaFree(x);
cudaFree(y);
return 0;
}
Profiling command: nvprof --metrics dram_read_transactions ./read_test “stride” “num_reads”
Results:
==1726484== Profiling application: ./read_test 1 1000
==1726484== Profiling result:
==1726484== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device “Tesla V100-SXM2-16GB (0)”
Kernel: read(float*, float*, int, int)
1 dram_read_transactions Device Memory Read Transactions 75 75 75
==1726509== Profiling application: ./read_test 512 1000
==1726509== Profiling result:
==1726509== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device “Tesla V100-SXM2-16GB (0)”
Kernel: read(float*, float*, int, int)
1 dram_read_transactions Device Memory Read Transactions 60 60 60
I greatly appreciate you help on this matter!