global memory bandwidth problem

Hi.

I’m trying to utilize the maximum bandwidth of the global memory. (For GTX 275 : 127GB/s)

I tested with a simple cuda program to see what’s the effective bandwidth,

but the result (from visual profiler) tells me that the bandwidth is not reaching the maximum.

The program copies 10241024 floats, and 10241024 threads are generated.

The results are,

Read/Write (read from global mem and write to another global mem addr) : 104.481 GB/s

Read only (from global mem to shared mem) : 67.3892 GB/s

Write only (just writing 1.0 to global mem) : 71.978 GB/s

I would like to ask,

  1. Why Read/Write does not reach the theoretical bandwidth of 127GB/sec?

    What prevents the on-chip memory controller from being fully utilized?

  2. Why Read only or Write only does give poor bandwidth compared to read/write?

    Where the difference comes from?

Could somebody help me interpreting the result?

The code I tested is as below.

Thanks!

#include <stdio.h>

#include <cuda.h>

__global__ void kernel_rw(float *A, float *B)

{

	int i = blockDim.x * blockIdx.x + threadIdx.x;

	B[i] = A[i];

	__syncthreads();

}

__global__ void kernel_ro(float *A, float *B)

{

	int i = blockDim.x * blockIdx.x + threadIdx.x;

	__shared__ float shared_mem[512];

	shared_mem[i%512] = A[i];

	__syncthreads();

}

__global__ void kernel_wo(float *A, float *B)

{

	int i = blockDim.x * blockIdx.x + threadIdx.x;

	B[i] = 1.0;

	__syncthreads();

}

int main(void) {

		int N = 1024*1024;

		size_t size = N * sizeof(float);

		float *d_A;

		cudaMalloc((void**)&d_A, size);

		float *d_B;

		cudaMalloc((void**)&d_B, size);

		dim3 dimBlock(512, 1, 1);

		dim3 dimGrid(N/512, 1);

		kernel_rw<<<dimGrid, dimBlock>>>(d_A, d_B);

		kernel_ro<<<dimGrid, dimBlock>>>(d_A, d_B);

		kernel_wo<<<dimGrid, dimBlock>>>(d_A, d_B);

		cudaThreadSynchronize();

		printf("Done\n");

}

Integer modulo is very expensive on current hardware - that might be a reason you read only kernel is slower than it should be.

Thanks.

shared_mem[i%512] = A[i];

should be changed to

shared_mem[threadIdx.x] = A[i];

And the result is 86.0052 GB/s for Read only.

But still the gap is not trivial.

Any other possibilities?

Substituting __mul24() for the integer multiplication in your indexing calculations will win a few cycles. Full 32 bit multiplication is also very slow on current hardware.

Hi, following methods might be useful if you want to reach the peak bandwidth:

  1. reduce use of smem, use rigisters instead as possible;

  2. mutilple memory read (or write) in one thread;

  3. choose the sizes of your dimBlock & dimGrid carafully;

  4. use all of your global memory as possible (N=1024*1024 in your program is too small to reach the peak);

  5. avoid the partition camping problem.

a maximum bandwith of 96%-97% to the peak is posible to achieved on a GT200 card, just have a look at this thread:

http://forums.nvidia.com/index.php?showtop…t=#entry1004107