Squeasing max d2d memory bandwidth (GTX 480)

embeddeduser · October 30, 2010, 1:05am

Hi,

I had earlier posted this in the Overclocking and Benchmarking forum but I think this is a more appropriate forum for my doubt.

I am a newbie to GPGPU programming and was trying to write a benchmark to test the write bandwidth on GTX 480 (device to device BW). Nvidia states a d2d BW of 177 GBps however i ran a small micro kernel which predominantly does writes to global memory and i am getting a much higher figure (around 190 GBps). Is this possible?

I think i might be making some mistake in my calculations or somehow i have my code wrong.

[codebox]global void MicroBM_BandwidthTest(float* A, int max_comp)

{

int j;

int max_comp = 1000;

int i = blockDim.x * blockIdx.x + threadIdx.x;

// currently not using any shared memory --> but i don't think that is required

for(j = 0;j < max_comp; j++)

	A[i]++;

}[/codebox]

Now, according to the above kernel as all memory is global hence there will be 1 global load and 1000 global stores for each thread and each data element.

The formula for calculating the BW is given below -

float bwinMBps = (num_ld_st /* Number of load stores in the kernel = 1001 */* ARRAY_SIZE /* Total data transfer */) * 1e3f /* Because time is in milli seconds */

				 / (elapsedTimeInMs * (float)(1 << 20) /* Divide 1 M */);

By this calculation i get a write BW (d2d) of 190 GBps.

Where am i going wrong?

All help is appreciated.

embeddeduser · October 30, 2010, 1:05am

Hi,

I had earlier posted this in the Overclocking and Benchmarking forum but I think this is a more appropriate forum for my doubt.

I am a newbie to GPGPU programming and was trying to write a benchmark to test the write bandwidth on GTX 480 (device to device BW). Nvidia states a d2d BW of 177 GBps however i ran a small micro kernel which predominantly does writes to global memory and i am getting a much higher figure (around 190 GBps). Is this possible?

I think i might be making some mistake in my calculations or somehow i have my code wrong.

[codebox]global void MicroBM_BandwidthTest(float* A, int max_comp)

{

int j;

int max_comp = 1000;

int i = blockDim.x * blockIdx.x + threadIdx.x;

// currently not using any shared memory --> but i don't think that is required

for(j = 0;j < max_comp; j++)

	A[i]++;

}[/codebox]

Now, according to the above kernel as all memory is global hence there will be 1 global load and 1000 global stores for each thread and each data element.

The formula for calculating the BW is given below -

float bwinMBps = (num_ld_st /* Number of load stores in the kernel = 1001 */* ARRAY_SIZE /* Total data transfer */) * 1e3f /* Because time is in milli seconds */

				 / (elapsedTimeInMs * (float)(1 << 20) /* Divide 1 M */);

By this calculation i get a write BW (d2d) of 190 GBps.

Where am i going wrong?

All help is appreciated.

Jimmy_Pettersson · October 30, 2010, 1:59pm

Is this what you are doing?

float throughput = num_ld_stARRAY_SIZE(1E-9) / (elapsedTimeInMs*(1E-3)) ;

you might also want to:

float throughput = float(num_ld_stARRAY_SIZE(1E-9)) / float((elapsedTimeInMs*(1E-3))) ;

Edit: ARRAY_SIZE should of course be size in bytes…

Jimmy_Pettersson · October 30, 2010, 1:59pm

Is this what you are doing?

float throughput = num_ld_stARRAY_SIZE(1E-9) / (elapsedTimeInMs*(1E-3)) ;

you might also want to:

float throughput = float(num_ld_stARRAY_SIZE(1E-9)) / float((elapsedTimeInMs*(1E-3))) ;

Edit: ARRAY_SIZE should of course be size in bytes…

seibert · October 31, 2010, 12:33am

[codebox]global void MicroBM_BandwidthTest(float* A, int max_comp)

{

int j;

int max_comp = 1000;

int i = blockDim.x * blockIdx.x + threadIdx.x;

// currently not using any shared memory --> but i don't think that is required

for(j = 0;j < max_comp; j++)

	A[i]++;

}[/codebox]

This is not testing what you think it is testing. The GTX 480 is a compute capability 2.0 device with an L1 and L2 cache. After the initial read from global memory, all of the read and write iterations on A[i] are either serviced by thread-local registers or they are serviced by the L1/L2 cache. (The compiler doesn’t need to flush the local value of A[i] to global memory until the end of the kernel, so the actual location of A[i] during execution depends on the PTX generated by the compiler.) Either way, very little traffic will actually pass over the memory bus.

To more properly benchmark this, you want to create two large arrays in global memory (ex: 50 MB each) and use a much simpler kernel to test:

__global__ void d2d(float *A, float *B)

{

  int i = blockDim.x * blockIdx.x + threadIdx.x;

  A[i] = B[i];

}

This simple array-to-array copy cannot be accelerated by a cache, so you will actually measure the memory bandwidth. Use 256 threads per block and however many blocks needed to cover the array. (You might also want to experiment and see if having each thread read and write several elements is faster.)

The CUDA SDK includes a bandwidthTest application that includes a device-to-device test. If you run it, you will notice that, in practice, device-to-device bandwidths tends to be 20-30% lower than the theoretical value.

seibert · October 31, 2010, 12:33am

[codebox]global void MicroBM_BandwidthTest(float* A, int max_comp)

{

int j;

int max_comp = 1000;

int i = blockDim.x * blockIdx.x + threadIdx.x;

// currently not using any shared memory --> but i don't think that is required

for(j = 0;j < max_comp; j++)

	A[i]++;

}[/codebox]

This is not testing what you think it is testing. The GTX 480 is a compute capability 2.0 device with an L1 and L2 cache. After the initial read from global memory, all of the read and write iterations on A[i] are either serviced by thread-local registers or they are serviced by the L1/L2 cache. (The compiler doesn’t need to flush the local value of A[i] to global memory until the end of the kernel, so the actual location of A[i] during execution depends on the PTX generated by the compiler.) Either way, very little traffic will actually pass over the memory bus.

To more properly benchmark this, you want to create two large arrays in global memory (ex: 50 MB each) and use a much simpler kernel to test:

__global__ void d2d(float *A, float *B)

{

  int i = blockDim.x * blockIdx.x + threadIdx.x;

  A[i] = B[i];

}

This simple array-to-array copy cannot be accelerated by a cache, so you will actually measure the memory bandwidth. Use 256 threads per block and however many blocks needed to cover the array. (You might also want to experiment and see if having each thread read and write several elements is faster.)

The CUDA SDK includes a bandwidthTest application that includes a device-to-device test. If you run it, you will notice that, in practice, device-to-device bandwidths tends to be 20-30% lower than the theoretical value.

embeddeduser · November 1, 2010, 12:07pm

@seibert: Thanks a lot for your reply.

I checked the definition of D2D BW as given by Nvidia and it appears that you are right. D2D BW is only copying data from one location in global memory to another. However, I don’t know how useful this metric actually is?

Again you are right … however, when i use the visual profiler it gives me the number of L2 write misses as 28301758. Multiplying this with 32 (as each is 32 Bytes) and dividing it by the GPU time (8300 micro secs) gives me a bandwidth of 110 GBps. Does this calculation seem to be correct?

However, I have a question here: The profiler also gives me the number of “gst request” per SM = 840000. Multiplying this by 128 (each gst request is 128 bytes) * 15 (num of SM’s) and then dividing by GPU time (8300 microseconds) gives me a BW of 192 GBps.

It seems the first calculation is correct (110 GBps) but then what is wrong with the second calculation?

To more properly benchmark this, you want to create two large arrays in global memory (ex: 50 MB each) and use a much simpler kernel to test:
__global__ void d2d(float *A, float *B)

{

  int i = blockDim.x * blockIdx.x + threadIdx.x;

  A[i] = B[i];

}
This simple array-to-array copy cannot be accelerated by a cache, so you will actually measure the memory bandwidth. Use 256 threads per block and however many blocks needed to cover the array. (You might also want to experiment and see if having each thread read and write several elements is faster.)

The CUDA SDK includes a bandwidthTest application that includes a device-to-device test. If you run it, you will notice that, in practice, device-to-device bandwidths tends to be 20-30% lower than the theoretical value.

I ran your kernel but it gives me a BW of only 100 GB/s. :(

What would be a good kernel to test the streaming load/store bandwidth (i.e. testing the L2 bandwidth)?

Thanks again for your help.

embeddeduser · November 1, 2010, 12:07pm

@seibert: Thanks a lot for your reply.

I checked the definition of D2D BW as given by Nvidia and it appears that you are right. D2D BW is only copying data from one location in global memory to another. However, I don’t know how useful this metric actually is?

Again you are right … however, when i use the visual profiler it gives me the number of L2 write misses as 28301758. Multiplying this with 32 (as each is 32 Bytes) and dividing it by the GPU time (8300 micro secs) gives me a bandwidth of 110 GBps. Does this calculation seem to be correct?

However, I have a question here: The profiler also gives me the number of “gst request” per SM = 840000. Multiplying this by 128 (each gst request is 128 bytes) * 15 (num of SM’s) and then dividing by GPU time (8300 microseconds) gives me a BW of 192 GBps.

It seems the first calculation is correct (110 GBps) but then what is wrong with the second calculation?

To more properly benchmark this, you want to create two large arrays in global memory (ex: 50 MB each) and use a much simpler kernel to test:
__global__ void d2d(float *A, float *B)

{

  int i = blockDim.x * blockIdx.x + threadIdx.x;

  A[i] = B[i];

}
This simple array-to-array copy cannot be accelerated by a cache, so you will actually measure the memory bandwidth. Use 256 threads per block and however many blocks needed to cover the array. (You might also want to experiment and see if having each thread read and write several elements is faster.)

The CUDA SDK includes a bandwidthTest application that includes a device-to-device test. If you run it, you will notice that, in practice, device-to-device bandwidths tends to be 20-30% lower than the theoretical value.

I ran your kernel but it gives me a BW of only 100 GB/s. :(

What would be a good kernel to test the streaming load/store bandwidth (i.e. testing the L2 bandwidth)?

Thanks again for your help.

seibert · November 1, 2010, 12:42pm

Very useful, since many CUDA workloads with low arithmetic intensity (defined as average number of math operations per element read) are limited by the speed of the bus to global memory. Historically, one of CUDA’s strong points has been looping over large datasets, where the fast GPU memory bus gives a high-end CUDA device a factor of 5-10x in performance over the host processor.

You can think of device-to-device as a practical test of the memory bus speed. Even if your kernel is mostly read or mostly write, it should be a reasonable predictor for memory speed limited kernels.

I would be wary of trying to calculate anything absolute from profiler counters. They only report counts from one of the multiprocessors (or perhaps one cluster of multiprocessors?), and can be very confusing to related to real-world quantities, like bandwidth. They are intended to be relative measures to compare kernels as you make changes.

In any case, I don’t think you can infer anything about memory bandwidth from your test. The kernel spends most of its time in a loop incrementing a variable, not reading or writing to global memory. (If you are talking about my kernel here, then the previous warning about interpreting counters applies.)

It is possible that this simple of a kernel is not completely optimal. (The hardware scheduler is a bit of a mystery.) You can try playing around with block size, or doing multiple reads in one thread and see what happens.

I think you could test this with the same kernel I gave, but modulus the index by some number that is smaller than the number of elements that can fit in L2 cache, but larger than L1. This should ensure that the vast majority of reads and writes are serviced by the L2 cache, but not the L1 cache.

Actually, you can be sure the L1 is not used if you pass to nvcc the compiler option:

-Xptxas -dlcm=cg

seibert · November 1, 2010, 12:42pm

Very useful, since many CUDA workloads with low arithmetic intensity (defined as average number of math operations per element read) are limited by the speed of the bus to global memory. Historically, one of CUDA’s strong points has been looping over large datasets, where the fast GPU memory bus gives a high-end CUDA device a factor of 5-10x in performance over the host processor.

You can think of device-to-device as a practical test of the memory bus speed. Even if your kernel is mostly read or mostly write, it should be a reasonable predictor for memory speed limited kernels.

I would be wary of trying to calculate anything absolute from profiler counters. They only report counts from one of the multiprocessors (or perhaps one cluster of multiprocessors?), and can be very confusing to related to real-world quantities, like bandwidth. They are intended to be relative measures to compare kernels as you make changes.

In any case, I don’t think you can infer anything about memory bandwidth from your test. The kernel spends most of its time in a loop incrementing a variable, not reading or writing to global memory. (If you are talking about my kernel here, then the previous warning about interpreting counters applies.)

It is possible that this simple of a kernel is not completely optimal. (The hardware scheduler is a bit of a mystery.) You can try playing around with block size, or doing multiple reads in one thread and see what happens.

I think you could test this with the same kernel I gave, but modulus the index by some number that is smaller than the number of elements that can fit in L2 cache, but larger than L1. This should ensure that the vast majority of reads and writes are serviced by the L2 cache, but not the L1 cache.

Actually, you can be sure the L1 is not used if you pass to nvcc the compiler option:

-Xptxas -dlcm=cg

embeddeduser · November 1, 2010, 6:24pm

Wouldn’t it be more appropriate to say that the low arithmetic intensity workloads require a high L2 bandwidth rather than a high glob. memory to glob. memory copy BW? Also, afterall it is always the L2 to glob. memory bandwidth (load/store) that should matter most.

I would beg to differ here - your point would be right if the grid size (i.e. number of blocks) is NOT a multiple of the number of SM’s. However, if it is a multiple of the number of SM’s then I think the profile counters should be quite reliable.

So, in this case wouldn’t the “number of L2 read/write misses” * 32 be the actual L2 BW achieved by the workload?

I will definitely try optimizing your code and try to get a measure of the bandwidth.

An unrelated question: Does the following formula for average IPC calculation sound correct?

(Instructions Executed * Number of SM’s * 1e6) / (GPU time in microseconds * 1.4 * 1e9) = IPC

embeddeduser · November 1, 2010, 6:24pm

Wouldn’t it be more appropriate to say that the low arithmetic intensity workloads require a high L2 bandwidth rather than a high glob. memory to glob. memory copy BW? Also, afterall it is always the L2 to glob. memory bandwidth (load/store) that should matter most.

I would beg to differ here - your point would be right if the grid size (i.e. number of blocks) is NOT a multiple of the number of SM’s. However, if it is a multiple of the number of SM’s then I think the profile counters should be quite reliable.

So, in this case wouldn’t the “number of L2 read/write misses” * 32 be the actual L2 BW achieved by the workload?

I will definitely try optimizing your code and try to get a measure of the bandwidth.

An unrelated question: Does the following formula for average IPC calculation sound correct?

(Instructions Executed * Number of SM’s * 1e6) / (GPU time in microseconds * 1.4 * 1e9) = IPC

seibert · November 1, 2010, 8:49pm

L2 bandwidth is unimportant for workloads that don’t benefit from a cache. (Since CUDA devices had no significant caching for the first 3 years of their existence, there are a lot of CUDA programs like this.) These “streaming” workloads involved looping over large datasets in each iteration. BTW, the global memory to global memory copy is a tool for measuring the GPU to global memory bandwidth.

Also, just to be sure I’m using the same terminology as you: When you say “L2 bandwidth” you mean “bandwidth between multiprocessor registers and the L2 cache,” right? If instead you mean “L2 to global memory bandwidth”, then we have been talking about the same thing, basically, without realizing it. :)

I think I need to know what you mean by “L2 Bandwidth”. See above.

What is the factor of 1.4 for?

seibert · November 1, 2010, 8:49pm

L2 bandwidth is unimportant for workloads that don’t benefit from a cache. (Since CUDA devices had no significant caching for the first 3 years of their existence, there are a lot of CUDA programs like this.) These “streaming” workloads involved looping over large datasets in each iteration. BTW, the global memory to global memory copy is a tool for measuring the GPU to global memory bandwidth.

Also, just to be sure I’m using the same terminology as you: When you say “L2 bandwidth” you mean “bandwidth between multiprocessor registers and the L2 cache,” right? If instead you mean “L2 to global memory bandwidth”, then we have been talking about the same thing, basically, without realizing it. :)

I think I need to know what you mean by “L2 Bandwidth”. See above.

What is the factor of 1.4 for?

embeddeduser · November 2, 2010, 6:34pm

L2 bandwidth is unimportant for workloads that don’t benefit from a cache. (Since CUDA devices had no significant caching for the first 3 years of their existence, there are a lot of CUDA programs like this.) These “streaming” workloads involved looping over large datasets in each iteration. BTW, the global memory to global memory copy is a tool for measuring the GPU to global memory bandwidth.

Also, just to be sure I’m using the same terminology as you: When you say “L2 bandwidth” you mean “bandwidth between multiprocessor registers and the L2 cache,” right? If instead you mean “L2 to global memory bandwidth”, then we have been talking about the same thing, basically, without realizing it. :)

I think I need to know what you mean by “L2 Bandwidth”. See above.

He he … I think we are talking about the same thing. By L2 BW I mean “L2 to Global Memory BW” :). So, now does the following formula make sense?

L2 BW in GBps = ((num L2 read misses + num L2 write misses) * 32) / ( GPU time in microseconds * (1e3f))

1.4 is due to the Fermi clock frequency (actually it is 1.4 * 1e9f)

embeddeduser · November 2, 2010, 6:34pm

L2 bandwidth is unimportant for workloads that don’t benefit from a cache. (Since CUDA devices had no significant caching for the first 3 years of their existence, there are a lot of CUDA programs like this.) These “streaming” workloads involved looping over large datasets in each iteration. BTW, the global memory to global memory copy is a tool for measuring the GPU to global memory bandwidth.

Also, just to be sure I’m using the same terminology as you: When you say “L2 bandwidth” you mean “bandwidth between multiprocessor registers and the L2 cache,” right? If instead you mean “L2 to global memory bandwidth”, then we have been talking about the same thing, basically, without realizing it. :)

I think I need to know what you mean by “L2 Bandwidth”. See above.

He he … I think we are talking about the same thing. By L2 BW I mean “L2 to Global Memory BW” :). So, now does the following formula make sense?

L2 BW in GBps = ((num L2 read misses + num L2 write misses) * 32) / ( GPU time in microseconds * (1e3f))

1.4 is due to the Fermi clock frequency (actually it is 1.4 * 1e9f)

Topic		Replies	Views
Memory copy by two CUDA kernels - why speed differs? CUDA Programming and Performance	10	668	September 28, 2018
GTX750Ti and buffers > 1GB on Win7 CUDA Programming and Performance	91	19882	July 21, 2016
Using bandwidthTest tool, D2D performance More than the official given bandwidth CUDA Programming and Performance cuda	6	846	October 28, 2022
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11153	May 23, 2010
Technical questions on GTX1080ti multiplication CUDA Programming and Performance	14	1917	November 11, 2017
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16300	January 30, 2011
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	1759	August 8, 2023
Tesla K40 L2 bandwidth CUDA Programming and Performance	12	4033	December 23, 2015
Putting the GPU at work CUDA Programming and Performance	21	20175	July 5, 2007
What's new in Maxwell 'sm_52' (GTX 9xx) ? CUDA Programming and Performance	69	26916	December 23, 2014

Squeasing max d2d memory bandwidth (GTX 480)

Related topics