Why my program bandwidth exceeds the standard bandwidth?

Hello guys:

I just started to get into some details of CUDA Programming. I made a 1024 x 1024 matrix multiplication, each block has 32 x 32 threads. I run the program (Win32 Mode) in Release version of Visual Studio 2012 Ultimate. It returns about 7ms for my kernel function (I used shared memory). My device has compute capability 3.0 (EVGA GeForce GTX 680 Classified 4G). So when I’m calculating the bandwidth in this way:
matrix_size * matrix_size * 32bit / 8 / 1024 / 1024 = 4GB, then 4GB / 0.07s = 528GB/s, however I checked the whitepaper of my card, it has only 192.25GB/s. Did I make something wrong? BTW I have synchronized after the kernel function, and used cudaEvent to sum the execution time.

Any ideas?

Thank you so much!

“Win32 Mode”

this is something i would flag in the back of my head, particularly when things do not work as planned

regardless, the underlying assumption is indeed that the memory copies actually took place
perhaps catch errors prior to and after the kernel launch to verify that the kernel indeed launched
thereafter, look at the output, and note whether there are any patterns in it that may suggest what is going wrong

and i think i do not necessarily agree with your calculation - i would think it is more in the region of 8 or 12 GB to transfer; perhaps i am mistaken

matrix_size * matrix_size * 32bit / 8 / 1024 / 1024 = 4GB


This is 4MB not 4GB:

matrix_size * matrix_size * 32bit / 8

1024*1024 = 1M

This is 4:

matrix_size * matrix_size * 32bit / 8 / 1024 / 1024

Not sure how you are coming up with 4GB

Yeah, thanks man, just after I posted my topic I realized I made a wrong calculation. But I still cannot get a correct bandwidth calculation. But the total calculation above I made a mistake. The bandwidth in matrix multiplication should be:

2 times read the data from global memory, so the time for read data should be 2n^3 right? And 1 time write the data back to the global memory, that is n^2. So there are 2n^3+n^2 for r/w global memory right?

And I just calculate as following: 2*n^3+n^2 bits / sec -> But the time is still over the standard bandwidth.

no, 2*n^3 is not correct. not sure that bits/sec plays into this anywhere either.

A nxn matrix has n^2 elements, not n^3 (perhaps you are thinking of computations - that is a separate subject from bandwidth).

Reading two nxn matrices requires 2*n^2 reads of the element type. For a float that would be 4 bytes per element.

Since we have two reads and one write, the bandwidth would be 3*n^2 * 4bytes / 0.07s = 12582912bytes/0.07s = 179755885 bytes/s or approximately 179MB/s

This is well under your max theoretical bandwidth of 192GB/s (approximately 1,000 times smaller.)

If your element type is double, you would just double the above numbers.

Although this methodology may be lacking in several respects, the fact that your required bandwidth is so much lower than the theoretical available bandwidth suggests that your problem may not be bandwidth bound, and indeed we do not expect matrix multiply problems to be bandwidth bound.

Thanks, is there any possible that makes my bandwidth exceeds the theoretical bandwidth?

I just re-compute the execution time I got: 0.019s for non-shared-memory using for matrix multiplication, and 0.007s for shared-memory using.

The non-shared memory using program looks very simple as the most cuda tutorial did:
global void MatrixMul(int* d_a, int* d_b, int* d_c)
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int idx = row * MATRIX_SIZE + col;
int sum = 0;

if( row > MATRIX_SIZE || col > MATRIX_SIZE ) return;

for(int i = 0; i != MATRIX_SIZE; i++){
	sum += d_a[row * MATRIX_SIZE + i] * d_b[i * MATRIX_SIZE + col];

d_c[idx] = sum;


And the shared memory using program as following:

device int GetElement(const int* d_a, int row, int col)
return d_a[row * MATRIX_SIZE + col];

device void SetElements(int* d_a, int row, int col, int value)
d_a[row * MATRIX_SIZE + col] = value;

device int* GetSubMatrix(int* d_a, int row, int col)
int* subA = 0;

subA = &d_a[MATRIX_SIZE * BLOCK_SIZE * row + BLOCK_SIZE * col];

return subA;


// Use shared memory
global void MatrixMulSharedMem(int* d_a, int* d_b, int* d_c)
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;

int* subC = GetSubMatrix(d_c, blockRow, blockCol);

int value = 0;

int row = threadIdx.y;
int col = threadIdx.x;

for(int i = 0; i != (MATRIX_SIZE / BLOCK_SIZE); i++)
	int* subA = GetSubMatrix(d_a, blockRow, i);
	int* subB = GetSubMatrix(d_b, i, blockCol);

	__shared__ int A_sub[BLOCK_SIZE][BLOCK_SIZE];
	__shared__ int B_sub[BLOCK_SIZE][BLOCK_SIZE];

	A_sub[row][col] = GetElement(subA, row, col);
	B_sub[row][col] = GetElement(subB, row, col);


	for(int j = 0; j != BLOCK_SIZE; j++)
		value += A_sub[row][j] * B_sub[j][col];


SetElements(subC, row, col, value);


I don’t know if your programs are correct or not, but your computed bandwidth does not exceed the theoretical bandwidth. Your recomputed times don’t change this fact. You might be achieving 1.79GB/s instead of 179MB/s with the recomputed time of 0.007s, but that still isn’t close to your theoretical bandwidth of 192GB/s.