Squeasing max d2d memory bandwidth (GTX 480)

Hi,

I had earlier posted this in the Overclocking and Benchmarking forum but I think this is a more appropriate forum for my doubt.

I am a newbie to GPGPU programming and was trying to write a benchmark to test the write bandwidth on GTX 480 (device to device BW). Nvidia states a d2d BW of 177 GBps however i ran a small micro kernel which predominantly does writes to global memory and i am getting a much higher figure (around 190 GBps). Is this possible?

I think i might be making some mistake in my calculations or somehow i have my code wrong.

[codebox]global void MicroBM_BandwidthTest(float* A, int max_comp)

{

int j;

int max_comp = 1000;

int i = blockDim.x * blockIdx.x + threadIdx.x;

// currently not using any shared memory --> but i don't think that is required

for(j = 0;j < max_comp; j++)

	A[i]++;

}[/codebox]

Now, according to the above kernel as all memory is global hence there will be 1 global load and 1000 global stores for each thread and each data element.

The formula for calculating the BW is given below -

float bwinMBps = (num_ld_st /* Number of load stores in the kernel = 1001 */* ARRAY_SIZE /* Total data transfer */) * 1e3f /* Because time is in milli seconds */

				 / (elapsedTimeInMs * (float)(1 << 20) /* Divide 1 M */);

By this calculation i get a write BW (d2d) of 190 GBps.

Where am i going wrong?

All help is appreciated.

Hi,

I had earlier posted this in the Overclocking and Benchmarking forum but I think this is a more appropriate forum for my doubt.

I am a newbie to GPGPU programming and was trying to write a benchmark to test the write bandwidth on GTX 480 (device to device BW). Nvidia states a d2d BW of 177 GBps however i ran a small micro kernel which predominantly does writes to global memory and i am getting a much higher figure (around 190 GBps). Is this possible?

I think i might be making some mistake in my calculations or somehow i have my code wrong.

[codebox]global void MicroBM_BandwidthTest(float* A, int max_comp)

{

int j;

int max_comp = 1000;

int i = blockDim.x * blockIdx.x + threadIdx.x;

// currently not using any shared memory --> but i don't think that is required

for(j = 0;j < max_comp; j++)

	A[i]++;

}[/codebox]

Now, according to the above kernel as all memory is global hence there will be 1 global load and 1000 global stores for each thread and each data element.

The formula for calculating the BW is given below -

float bwinMBps = (num_ld_st /* Number of load stores in the kernel = 1001 */* ARRAY_SIZE /* Total data transfer */) * 1e3f /* Because time is in milli seconds */

				 / (elapsedTimeInMs * (float)(1 << 20) /* Divide 1 M */);

By this calculation i get a write BW (d2d) of 190 GBps.

Where am i going wrong?

All help is appreciated.

Is this what you are doing?

float throughput = num_ld_stARRAY_SIZE(1E-9) / (elapsedTimeInMs*(1E-3)) ;

you might also want to:

float throughput = float(num_ld_stARRAY_SIZE(1E-9)) / float((elapsedTimeInMs*(1E-3))) ;

Edit: ARRAY_SIZE should of course be size in bytes…

Is this what you are doing?

float throughput = num_ld_stARRAY_SIZE(1E-9) / (elapsedTimeInMs*(1E-3)) ;

you might also want to:

float throughput = float(num_ld_stARRAY_SIZE(1E-9)) / float((elapsedTimeInMs*(1E-3))) ;

Edit: ARRAY_SIZE should of course be size in bytes…

This is not testing what you think it is testing. The GTX 480 is a compute capability 2.0 device with an L1 and L2 cache. After the initial read from global memory, all of the read and write iterations on A[i] are either serviced by thread-local registers or they are serviced by the L1/L2 cache. (The compiler doesn’t need to flush the local value of A[i] to global memory until the end of the kernel, so the actual location of A[i] during execution depends on the PTX generated by the compiler.) Either way, very little traffic will actually pass over the memory bus.

To more properly benchmark this, you want to create two large arrays in global memory (ex: 50 MB each) and use a much simpler kernel to test:

__global__ void d2d(float *A, float *B)

{

  int i = blockDim.x * blockIdx.x + threadIdx.x;

  A[i] = B[i];

}

This simple array-to-array copy cannot be accelerated by a cache, so you will actually measure the memory bandwidth. Use 256 threads per block and however many blocks needed to cover the array. (You might also want to experiment and see if having each thread read and write several elements is faster.)

The CUDA SDK includes a bandwidthTest application that includes a device-to-device test. If you run it, you will notice that, in practice, device-to-device bandwidths tends to be 20-30% lower than the theoretical value.

This is not testing what you think it is testing. The GTX 480 is a compute capability 2.0 device with an L1 and L2 cache. After the initial read from global memory, all of the read and write iterations on A[i] are either serviced by thread-local registers or they are serviced by the L1/L2 cache. (The compiler doesn’t need to flush the local value of A[i] to global memory until the end of the kernel, so the actual location of A[i] during execution depends on the PTX generated by the compiler.) Either way, very little traffic will actually pass over the memory bus.

To more properly benchmark this, you want to create two large arrays in global memory (ex: 50 MB each) and use a much simpler kernel to test:

__global__ void d2d(float *A, float *B)

{

  int i = blockDim.x * blockIdx.x + threadIdx.x;

  A[i] = B[i];

}

This simple array-to-array copy cannot be accelerated by a cache, so you will actually measure the memory bandwidth. Use 256 threads per block and however many blocks needed to cover the array. (You might also want to experiment and see if having each thread read and write several elements is faster.)

The CUDA SDK includes a bandwidthTest application that includes a device-to-device test. If you run it, you will notice that, in practice, device-to-device bandwidths tends to be 20-30% lower than the theoretical value.

@seibert: Thanks a lot for your reply.

I checked the definition of D2D BW as given by Nvidia and it appears that you are right. D2D BW is only copying data from one location in global memory to another. However, I don’t know how useful this metric actually is?

Again you are right … however, when i use the visual profiler it gives me the number of L2 write misses as 28301758. Multiplying this with 32 (as each is 32 Bytes) and dividing it by the GPU time (8300 micro secs) gives me a bandwidth of 110 GBps. Does this calculation seem to be correct?

However, I have a question here: The profiler also gives me the number of “gst request” per SM = 840000. Multiplying this by 128 (each gst request is 128 bytes) * 15 (num of SM’s) and then dividing by GPU time (8300 microseconds) gives me a BW of 192 GBps.

It seems the first calculation is correct (110 GBps) but then what is wrong with the second calculation?

I ran your kernel but it gives me a BW of only 100 GB/s. :(

What would be a good kernel to test the streaming load/store bandwidth (i.e. testing the L2 bandwidth)?

Thanks again for your help.

@seibert: Thanks a lot for your reply.

I checked the definition of D2D BW as given by Nvidia and it appears that you are right. D2D BW is only copying data from one location in global memory to another. However, I don’t know how useful this metric actually is?

Again you are right … however, when i use the visual profiler it gives me the number of L2 write misses as 28301758. Multiplying this with 32 (as each is 32 Bytes) and dividing it by the GPU time (8300 micro secs) gives me a bandwidth of 110 GBps. Does this calculation seem to be correct?

However, I have a question here: The profiler also gives me the number of “gst request” per SM = 840000. Multiplying this by 128 (each gst request is 128 bytes) * 15 (num of SM’s) and then dividing by GPU time (8300 microseconds) gives me a BW of 192 GBps.

It seems the first calculation is correct (110 GBps) but then what is wrong with the second calculation?

I ran your kernel but it gives me a BW of only 100 GB/s. :(

What would be a good kernel to test the streaming load/store bandwidth (i.e. testing the L2 bandwidth)?

Thanks again for your help.

Very useful, since many CUDA workloads with low arithmetic intensity (defined as average number of math operations per element read) are limited by the speed of the bus to global memory. Historically, one of CUDA’s strong points has been looping over large datasets, where the fast GPU memory bus gives a high-end CUDA device a factor of 5-10x in performance over the host processor.

You can think of device-to-device as a practical test of the memory bus speed. Even if your kernel is mostly read or mostly write, it should be a reasonable predictor for memory speed limited kernels.

I would be wary of trying to calculate anything absolute from profiler counters. They only report counts from one of the multiprocessors (or perhaps one cluster of multiprocessors?), and can be very confusing to related to real-world quantities, like bandwidth. They are intended to be relative measures to compare kernels as you make changes.

In any case, I don’t think you can infer anything about memory bandwidth from your test. The kernel spends most of its time in a loop incrementing a variable, not reading or writing to global memory. (If you are talking about my kernel here, then the previous warning about interpreting counters applies.)

It is possible that this simple of a kernel is not completely optimal. (The hardware scheduler is a bit of a mystery.) You can try playing around with block size, or doing multiple reads in one thread and see what happens.

I think you could test this with the same kernel I gave, but modulus the index by some number that is smaller than the number of elements that can fit in L2 cache, but larger than L1. This should ensure that the vast majority of reads and writes are serviced by the L2 cache, but not the L1 cache.

Actually, you can be sure the L1 is not used if you pass to nvcc the compiler option:

-Xptxas -dlcm=cg

Very useful, since many CUDA workloads with low arithmetic intensity (defined as average number of math operations per element read) are limited by the speed of the bus to global memory. Historically, one of CUDA’s strong points has been looping over large datasets, where the fast GPU memory bus gives a high-end CUDA device a factor of 5-10x in performance over the host processor.

You can think of device-to-device as a practical test of the memory bus speed. Even if your kernel is mostly read or mostly write, it should be a reasonable predictor for memory speed limited kernels.

I would be wary of trying to calculate anything absolute from profiler counters. They only report counts from one of the multiprocessors (or perhaps one cluster of multiprocessors?), and can be very confusing to related to real-world quantities, like bandwidth. They are intended to be relative measures to compare kernels as you make changes.

In any case, I don’t think you can infer anything about memory bandwidth from your test. The kernel spends most of its time in a loop incrementing a variable, not reading or writing to global memory. (If you are talking about my kernel here, then the previous warning about interpreting counters applies.)

It is possible that this simple of a kernel is not completely optimal. (The hardware scheduler is a bit of a mystery.) You can try playing around with block size, or doing multiple reads in one thread and see what happens.

I think you could test this with the same kernel I gave, but modulus the index by some number that is smaller than the number of elements that can fit in L2 cache, but larger than L1. This should ensure that the vast majority of reads and writes are serviced by the L2 cache, but not the L1 cache.

Actually, you can be sure the L1 is not used if you pass to nvcc the compiler option:

-Xptxas -dlcm=cg

Wouldn’t it be more appropriate to say that the low arithmetic intensity workloads require a high L2 bandwidth rather than a high glob. memory to glob. memory copy BW? Also, afterall it is always the L2 to glob. memory bandwidth (load/store) that should matter most.

I would beg to differ here - your point would be right if the grid size (i.e. number of blocks) is NOT a multiple of the number of SM’s. However, if it is a multiple of the number of SM’s then I think the profile counters should be quite reliable.

So, in this case wouldn’t the “number of L2 read/write misses” * 32 be the actual L2 BW achieved by the workload?

I will definitely try optimizing your code and try to get a measure of the bandwidth.

An unrelated question: Does the following formula for average IPC calculation sound correct?

(Instructions Executed * Number of SM’s * 1e6) / (GPU time in microseconds * 1.4 * 1e9) = IPC

Wouldn’t it be more appropriate to say that the low arithmetic intensity workloads require a high L2 bandwidth rather than a high glob. memory to glob. memory copy BW? Also, afterall it is always the L2 to glob. memory bandwidth (load/store) that should matter most.

I would beg to differ here - your point would be right if the grid size (i.e. number of blocks) is NOT a multiple of the number of SM’s. However, if it is a multiple of the number of SM’s then I think the profile counters should be quite reliable.

So, in this case wouldn’t the “number of L2 read/write misses” * 32 be the actual L2 BW achieved by the workload?

I will definitely try optimizing your code and try to get a measure of the bandwidth.

An unrelated question: Does the following formula for average IPC calculation sound correct?

(Instructions Executed * Number of SM’s * 1e6) / (GPU time in microseconds * 1.4 * 1e9) = IPC

L2 bandwidth is unimportant for workloads that don’t benefit from a cache. (Since CUDA devices had no significant caching for the first 3 years of their existence, there are a lot of CUDA programs like this.) These “streaming” workloads involved looping over large datasets in each iteration. BTW, the global memory to global memory copy is a tool for measuring the GPU to global memory bandwidth.

Also, just to be sure I’m using the same terminology as you: When you say “L2 bandwidth” you mean “bandwidth between multiprocessor registers and the L2 cache,” right? If instead you mean “L2 to global memory bandwidth”, then we have been talking about the same thing, basically, without realizing it. :)

I think I need to know what you mean by “L2 Bandwidth”. See above.

What is the factor of 1.4 for?

L2 bandwidth is unimportant for workloads that don’t benefit from a cache. (Since CUDA devices had no significant caching for the first 3 years of their existence, there are a lot of CUDA programs like this.) These “streaming” workloads involved looping over large datasets in each iteration. BTW, the global memory to global memory copy is a tool for measuring the GPU to global memory bandwidth.

Also, just to be sure I’m using the same terminology as you: When you say “L2 bandwidth” you mean “bandwidth between multiprocessor registers and the L2 cache,” right? If instead you mean “L2 to global memory bandwidth”, then we have been talking about the same thing, basically, without realizing it. :)

I think I need to know what you mean by “L2 Bandwidth”. See above.

What is the factor of 1.4 for?

He he … I think we are talking about the same thing. By L2 BW I mean “L2 to Global Memory BW” :). So, now does the following formula make sense?

L2 BW in GBps = ((num L2 read misses + num L2 write misses) * 32) / ( GPU time in microseconds * (1e3f))

1.4 is due to the Fermi clock frequency (actually it is 1.4 * 1e9f)

He he … I think we are talking about the same thing. By L2 BW I mean “L2 to Global Memory BW” :). So, now does the following formula make sense?

L2 BW in GBps = ((num L2 read misses + num L2 write misses) * 32) / ( GPU time in microseconds * (1e3f))

1.4 is due to the Fermi clock frequency (actually it is 1.4 * 1e9f)