Why my program bandwidth exceeds the standard bandwidth?

cuixue · April 3, 2015, 6:00am

Hello guys:

I just started to get into some details of CUDA Programming. I made a 1024 x 1024 matrix multiplication, each block has 32 x 32 threads. I run the program (Win32 Mode) in Release version of Visual Studio 2012 Ultimate. It returns about 7ms for my kernel function (I used shared memory). My device has compute capability 3.0 (EVGA GeForce GTX 680 Classified 4G). So when I’m calculating the bandwidth in this way:
matrix_size * matrix_size * 32bit / 8 / 1024 / 1024 = 4GB, then 4GB / 0.07s = 528GB/s, however I checked the whitepaper of my card, it has only 192.25GB/s. Did I make something wrong? BTW I have synchronized after the kernel function, and used cudaEvent to sum the execution time.

Any ideas?

Thank you so much!

little_jimmy · April 3, 2015, 7:13am

“Win32 Mode”

this is something i would flag in the back of my head, particularly when things do not work as planned

regardless, the underlying assumption is indeed that the memory copies actually took place
perhaps catch errors prior to and after the kernel launch to verify that the kernel indeed launched
thereafter, look at the output, and note whether there are any patterns in it that may suggest what is going wrong

and i think i do not necessarily agree with your calculation - i would think it is more in the region of 8 or 12 GB to transfer; perhaps i am mistaken

Robert_Crovella · April 3, 2015, 1:16pm

matrix_size * matrix_size * 32bit / 8 / 1024 / 1024 = 4GB

??

This is 4MB not 4GB:

matrix_size * matrix_size * 32bit / 8

1024*1024 = 1M

This is 4:

matrix_size * matrix_size * 32bit / 8 / 1024 / 1024

Not sure how you are coming up with 4GB

cuixue · April 3, 2015, 3:05pm

Yeah, thanks man, just after I posted my topic I realized I made a wrong calculation. But I still cannot get a correct bandwidth calculation. But the total calculation above I made a mistake. The bandwidth in matrix multiplication should be:

2 times read the data from global memory, so the time for read data should be 2n^3 right? And 1 time write the data back to the global memory, that is n^2. So there are 2n^3+n^2 for r/w global memory right?

And I just calculate as following: 2*n^3+n^2 bits / sec → But the time is still over the standard bandwidth.

Robert_Crovella · April 3, 2015, 3:38pm

no, 2*n^3 is not correct. not sure that bits/sec plays into this anywhere either.

A nxn matrix has n^2 elements, not n^3 (perhaps you are thinking of computations - that is a separate subject from bandwidth).

Reading two nxn matrices requires 2*n^2 reads of the element type. For a float that would be 4 bytes per element.

Since we have two reads and one write, the bandwidth would be 3*n^2 * 4bytes / 0.07s = 12582912bytes/0.07s = 179755885 bytes/s or approximately 179MB/s

This is well under your max theoretical bandwidth of 192GB/s (approximately 1,000 times smaller.)

If your element type is double, you would just double the above numbers.

Although this methodology may be lacking in several respects, the fact that your required bandwidth is so much lower than the theoretical available bandwidth suggests that your problem may not be bandwidth bound, and indeed we do not expect matrix multiply problems to be bandwidth bound.

cuixue · April 3, 2015, 4:54pm

txbob:

no, 2*n^3 is not correct. not sure that bits/sec plays into this anywhere either.

A nxn matrix has n^2 elements, not n^3 (perhaps you are thinking of computations - that is a separate subject from bandwidth).

Reading two nxn matrices requires 2*n^2 reads of the element type. For a float that would be 4 bytes per element.

Since we have two reads and one write, the bandwidth would be 3*n^2 * 4bytes / 0.07s = 12582912bytes/0.07s = 179755885 bytes/s or approximately 179MB/s

This is well under your max theoretical bandwidth of 192GB/s (approximately 1,000 times smaller.)

If your element type is double, you would just double the above numbers.

Although this methodology may be lacking in several respects, the fact that your required bandwidth is so much lower than the theoretical available bandwidth suggests that your problem may not be bandwidth bound, and indeed we do not expect matrix multiply problems to be bandwidth bound.

Thanks, is there any possible that makes my bandwidth exceeds the theoretical bandwidth?

I just re-compute the execution time I got: 0.019s for non-shared-memory using for matrix multiplication, and 0.007s for shared-memory using.

The non-shared memory using program looks very simple as the most cuda tutorial did:
global void MatrixMul(int* d_a, int* d_b, int* d_c)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int idx = row * MATRIX_SIZE + col;
int sum = 0;

if( row > MATRIX_SIZE || col > MATRIX_SIZE ) return;

for(int i = 0; i != MATRIX_SIZE; i++){
	sum += d_a[row * MATRIX_SIZE + i] * d_b[i * MATRIX_SIZE + col];
}

d_c[idx] = sum;

}

And the shared memory using program as following:

device int GetElement(const int* d_a, int row, int col)
{
return d_a[row * MATRIX_SIZE + col];
}

device void SetElements(int* d_a, int row, int col, int value)
{
d_a[row * MATRIX_SIZE + col] = value;
}

device int* GetSubMatrix(int* d_a, int row, int col)
{
int* subA = 0;

subA = &d_a[MATRIX_SIZE * BLOCK_SIZE * row + BLOCK_SIZE * col];

return subA;

}

// Use shared memory
global void MatrixMulSharedMem(int* d_a, int* d_b, int* d_c)
{
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;

int* subC = GetSubMatrix(d_c, blockRow, blockCol);

int value = 0;

int row = threadIdx.y;
int col = threadIdx.x;

for(int i = 0; i != (MATRIX_SIZE / BLOCK_SIZE); i++)
{
	int* subA = GetSubMatrix(d_a, blockRow, i);
	int* subB = GetSubMatrix(d_b, i, blockCol);

	__shared__ int A_sub[BLOCK_SIZE][BLOCK_SIZE];
	__shared__ int B_sub[BLOCK_SIZE][BLOCK_SIZE];

	A_sub[row][col] = GetElement(subA, row, col);
	B_sub[row][col] = GetElement(subB, row, col);

	__syncthreads();

	for(int j = 0; j != BLOCK_SIZE; j++)
	{
		value += A_sub[row][j] * B_sub[j][col];
	}

	__syncthreads();
}

SetElements(subC, row, col, value);

}

Robert_Crovella · April 3, 2015, 5:06pm

I don’t know if your programs are correct or not, but your computed bandwidth does not exceed the theoretical bandwidth. Your recomputed times don’t change this fact. You might be achieving 1.79GB/s instead of 179MB/s with the recomputed time of 0.007s, but that still isn’t close to your theoretical bandwidth of 192GB/s.

Topic		Replies	Views
Effective Bandwidth Problem CUDA Programming and Performance	13	7855	March 23, 2011
Bandwidth calculation Newbie question... CUDA Programming and Performance	10	5526	August 1, 2008
measuring used memory bandwidth CUDA Programming and Performance	0	4512	August 12, 2010
Why doesn't this kernel reach the bandwidth max for my GPU? CUDA Programming and Performance	2	226	May 31, 2024
Speed-up and bandwidth CUDA Programming and Performance	12	9944	May 4, 2008
Maximum bandwith? CUDA Programming and Performance	4	4491	April 16, 2008
Using bandwidthTest tool, D2D performance More than the official given bandwidth CUDA Programming and Performance cuda	6	956	October 28, 2022
Very low global memory bandwidth, advice wanted CUDA Programming and Performance	3	593	December 18, 2018
Using bandwidthTest, D2D performance exceeds theoretical bandwidth CUDA Programming and Performance cuda	1	459	October 27, 2022
Effective bandwidth between using shared memory and global memory CUDA Programming and Performance	0	397	August 2, 2020

Why my program bandwidth exceeds the standard bandwidth?

Related topics