global memory bandwidth problem

HyoukJoong_Lee · March 1, 2010, 7:51am

Hi.

I’m trying to utilize the maximum bandwidth of the global memory. (For GTX 275 : 127GB/s)

I tested with a simple cuda program to see what’s the effective bandwidth,

but the result (from visual profiler) tells me that the bandwidth is not reaching the maximum.

The program copies 10241024 floats, and 10241024 threads are generated.

The results are,

Read/Write (read from global mem and write to another global mem addr) : 104.481 GB/s

Read only (from global mem to shared mem) : 67.3892 GB/s

Write only (just writing 1.0 to global mem) : 71.978 GB/s

I would like to ask,

Why Read/Write does not reach the theoretical bandwidth of 127GB/sec?

What prevents the on-chip memory controller from being fully utilized?
Why Read only or Write only does give poor bandwidth compared to read/write?

Where the difference comes from?

Could somebody help me interpreting the result?

The code I tested is as below.

Thanks!

#include <stdio.h>

#include <cuda.h>

__global__ void kernel_rw(float *A, float *B)

{

	int i = blockDim.x * blockIdx.x + threadIdx.x;

	B[i] = A[i];

	__syncthreads();

}

__global__ void kernel_ro(float *A, float *B)

{

	int i = blockDim.x * blockIdx.x + threadIdx.x;

	__shared__ float shared_mem[512];

	shared_mem[i%512] = A[i];

	__syncthreads();

}

__global__ void kernel_wo(float *A, float *B)

{

	int i = blockDim.x * blockIdx.x + threadIdx.x;

	B[i] = 1.0;

	__syncthreads();

}

int main(void) {

		int N = 1024*1024;

		size_t size = N * sizeof(float);

		float *d_A;

		cudaMalloc((void**)&d_A, size);

		float *d_B;

		cudaMalloc((void**)&d_B, size);

		dim3 dimBlock(512, 1, 1);

		dim3 dimGrid(N/512, 1);

		kernel_rw<<<dimGrid, dimBlock>>>(d_A, d_B);

		kernel_ro<<<dimGrid, dimBlock>>>(d_A, d_B);

		kernel_wo<<<dimGrid, dimBlock>>>(d_A, d_B);

		cudaThreadSynchronize();

		printf("Done\n");

}

avidday · March 1, 2010, 7:58am

Integer modulo is very expensive on current hardware - that might be a reason you read only kernel is slower than it should be.

HyoukJoong_Lee · March 1, 2010, 7:36pm

Thanks.

shared_mem[i%512] = A[i];

should be changed to

shared_mem[threadIdx.x] = A[i];

And the result is 86.0052 GB/s for Read only.

But still the gap is not trivial.

Any other possibilities?

avidday · March 1, 2010, 8:15pm

Substituting __mul24() for the integer multiplication in your indexing calculations will win a few cycles. Full 32 bit multiplication is also very slow on current hardware.

cuda2010 · March 2, 2010, 2:52am

Hi, following methods might be useful if you want to reach the peak bandwidth:

reduce use of smem, use rigisters instead as possible;
mutilple memory read (or write) in one thread;
choose the sizes of your dimBlock & dimGrid carafully;
use all of your global memory as possible (N=1024*1024 in your program is too small to reach the peak);
avoid the partition camping problem.

a maximum bandwith of 96%-97% to the peak is posible to achieved on a GT200 card, just have a look at this thread:

http://forums.nvidia.com/index.php?showtop…t=#entry1004107

Hi.

I’m trying to utilize the maximum bandwidth of the global memory. (For GTX 275 : 127GB/s)

I tested with a simple cuda program to see what’s the effective bandwidth,

but the result (from visual profiler) tells me that the bandwidth is not reaching the maximum.

The program copies 10241024 floats, and 10241024 threads are generated.

The results are,

Read/Write (read from global mem and write to another global mem addr) : 104.481 GB/s

Read only (from global mem to shared mem) : 67.3892 GB/s

Write only (just writing 1.0 to global mem) : 71.978 GB/s

I would like to ask,

Why Read/Write does not reach the theoretical bandwidth of 127GB/sec?

What prevents the on-chip memory controller from being fully utilized?

Why Read only or Write only does give poor bandwidth compared to read/write?

Where the difference comes from?

Could somebody help me interpreting the result?

The code I tested is as below.

Thanks!
#include <stdio.h>

#include <cuda.h>

__global__ void kernel_rw(float *A, float *B)

{

	int i = blockDim.x * blockIdx.x + threadIdx.x;

	B[i] = A[i];

	__syncthreads();

}

__global__ void kernel_ro(float *A, float *B)

{

	int i = blockDim.x * blockIdx.x + threadIdx.x;

	__shared__ float shared_mem[512];

	shared_mem[i%512] = A[i];

	__syncthreads();

}

__global__ void kernel_wo(float *A, float *B)

{

	int i = blockDim.x * blockIdx.x + threadIdx.x;

	B[i] = 1.0;

	__syncthreads();

}

int main(void) {

		int N = 1024*1024;

		size_t size = N * sizeof(float);

		float *d_A;

		cudaMalloc((void**)&d_A, size);

		float *d_B;

		cudaMalloc((void**)&d_B, size);

		dim3 dimBlock(512, 1, 1);

		dim3 dimGrid(N/512, 1);

		kernel_rw<<<dimGrid, dimBlock>>>(d_A, d_B);

		kernel_ro<<<dimGrid, dimBlock>>>(d_A, d_B);

		kernel_wo<<<dimGrid, dimBlock>>>(d_A, d_B);

		cudaThreadSynchronize();

		printf("Done\n");

}

Topic		Replies	Views
Global memory bandwidth on GTX 690 CUDA Programming and Performance	5	1566	September 13, 2014
Effective bandwidth between using shared memory and global memory CUDA Programming and Performance	0	360	August 2, 2020
Illogical Bandwidth results Read and Write Globalmemory CUDA Programming and Performance	0	2357	July 7, 2008
Shared memory as slow as global memory CUDA Programming and Performance	8	4406	September 5, 2016
Effective global memory bandwidth? CUDA Programming and Performance	17	17571	September 18, 2007
Performance test sharedmemory <-> globalmemory CUDA Programming and Performance	2	3933	May 30, 2008
Maximum bandwith? CUDA Programming and Performance	4	4423	April 16, 2008
Low Bandwidth with simple data copy CUDA Programming and Performance	4	9116	December 7, 2011
Global Memory Read Throughput CUDA Programming and Performance	2	735	October 8, 2009
memory bandwidth device to SM bandwidth CUDA Programming and Performance	9	4721	June 10, 2008

global memory bandwidth problem

Related topics