Effective Bandwidth Problem

X-FrEaK · March 22, 2011, 9:58pm

Hi all.

I have a CUDA kernel that multiplies two matrices which Width and Height are multiples of the blocksize i am using.

The Nvidia Quadro Fx 3800 I am using has a theoretical bandwidth of 50 Gb/s and I am having some strange results(Effective Bandwidth larger than Theoretical Bandwidth)

here are some results:

With Blocksize 2

[10][10] * [10][10] → BW=0,02 Gb/s

[1000][1000]*[1000][1000] → BW=69,4 Gb/s

With Blocksize 64

[1000][1000] * [1000][1000] → BW=486,4 Gb/s

[10000][10000] * [10000][10000] → BW= 45072,12 Gb/s

I took the effective bandwidth formula from the Nvidia Best Practices Guide(I have simplified it but its equivalent(unless there is a stupid mistake)). I think the kernel is fine as its very similar(if not equal) to some Nvidia Lectures I read and also because its working properly(afaik).

#define blocksize 64

#define HM (10000) 

#define WM (10000) 

#define WN (10000)

#define HN WM 

#define WP WN   

#define HP HM  

#define PTH WM

#define PTW HM

__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)

   	{

	__shared__ float MS[blocksize][blocksize];

	__shared__ float NS[blocksize][blocksize];

	int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;

	int rowM=ty+by*blocksize;

	int colN=tx+bx*blocksize;

	int Pvalue=0;

	for(int m=0; m< uWM/blocksize;m++){

	    MS[ty][tx]=M[rowM*uWM+(m*blocksize+tx)];

	    NS[ty][tx]=M[colN + uWN*(m*blocksize+ty)];

	__syncthreads();

	    for(int k=0;k<blocksize;k++)

        	Pvalue+=MS[ty][k]*NS[k][tx];

	P[rowM*WP+colN]=Pvalue;

	}

	}

int main(){

	cudaEvent_t evstart, evstop;

	cudaEventCreate(&evstart);

	cudaEventCreate(&evstop);

	float*M=(float*)malloc(sizeof(float)*HM*WM);

	float*N=(float*)malloc(sizeof(float)*HN*WN);

	for(int i=0;i<WM*HM;i++)

	    M[i]=(float)i;

	for(int i=0;i<WN*HN;i++)

	    N[i]=(float)i;

	float*P=(float*)malloc(sizeof(float)*HP*WP);

	float *Md,*Nd,*Pd;

	cudaMalloc((void**)&Md,HM*WM*sizeof(float));

	cudaMalloc((void**)&Nd,HN*WN*sizeof(float));

	cudaMalloc((void**)&Pd,HP*WP*sizeof(float));

	cudaMemcpy(Md,M,HM*WM*sizeof(float),cudaMemcpyHostToDevice);

	cudaMemcpy(Nd,N,HN*WN*sizeof(float),cudaMemcpyHostToDevice);

	dim3 dimBlock(blocksize,blocksize);//(tile_width , tile_width);

	dim3 dimGrid(WN/dimBlock.x,HM/dimBlock.y);//(width/tile_width , width/tile_witdh);

	cudaEventRecord(evstart,0);

	nonsquare<<<dimGrid,dimBlock>>>(Md,Nd,Pd,WM,WN);

	cudaEventRecord(evstop,0);

	cudaEventSynchronize(evstop);

	float time;

	cudaEventElapsedTime(&time,evstart,evstop);

	cudaMemcpy(P,Pd,WP*HP*sizeof(float),cudaMemcpyDeviceToHost);

	cudaFree(Md);

	cudaFree(Nd);

	cudaFree(Pd);

	printf("\ntime spent:%f",time);

	float Bandwidth=(HM*WM*4+WN*HN*4+HP*WP*4)/(time*1000000);

	printf("\nEffective Bandwidth:%f Gb/s\n",Bandwidth);

    	}

How does the effective bandwidth surpasses the theoretical?

I would really appreciate your help as my life depends on this(almost litteraly), thanks in advance!

tmurray · March 22, 2011, 10:36pm

you can’t launch a 64x64 block, you’re timing nothing

X-FrEaK · March 22, 2011, 10:50pm

Jesus…I was so blind that i could go to 512x512…So the maximum of a 2D block is 16x16 then…I am really sorry, i feel kind of dumb External Image.

That problem aside, there arent any problems with the bandwidth formula right?

X-FrEaK · March 23, 2011, 9:35am

tmurray, sorry but although what you said is true, the effective bandwidth for blocksize two in the 1000x1000 example is 69 Gb/s, which is larger than the theoretical 50 Gb/s

avidday · March 23, 2011, 10:17am

Your bandwidth formula is most probably wrong. I would guess compiler optimization will greatly reduce the number of writes done down to 1 per thread.

X-FrEaK · March 23, 2011, 11:19am

Hum, i rechecked the formula and i think its good, but i might be thinking wrong.

So the program does a matrix matrix multiplication, my bandwidth formula is then:

Bytes read= MWidthMHeight4(float) + NWidthNHeight4(float)
Bytes written= PWidthPHeight4(float)

And then i divide by “time*100000”(which contains the 10^9 bytes to Gb factor and the 10-Â³ milliseconds to seconds factor)

Is something wrong with this line of thought?

avidday · March 23, 2011, 11:55am

There is nothing wrong with that line of thought. I am questioning whether your kernel code actually does what your bandwidth formula assumes it does.

X-FrEaK · March 23, 2011, 12:08pm

Sorry if this is completely stupid but:

I have tried a lot of different matrices to do the multiplication, and the results are fine.(I dont know if this implies that the kernel is doing what the bandwidth formula assumes. Also I’m sorry for my insecurity, but i had very little preparation before i started working with CUDA and before i start implementing the ART and OSEM algorithms i have to make sure im measuring the bandwidth right)

avidday · March 23, 2011, 1:02pm

I am not saying there is necessarily anything wrong - I am just very skeptical that the code you posted performs as you say it does. Looking at the kernel again:

__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)

{

    __shared__ float MS[blocksize][blocksize];

    __shared__ float NS[blocksize][blocksize];

int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;

    int rowM=ty+by*blocksize;

    int colN=tx+bx*blocksize;

    int Pvalue=0;

for(int m=0; m< uWM/blocksize;m++){

        MS[ty][tx]=M[rowM*uWM+(m*blocksize+tx)];

        NS[ty][tx]=M[colN + uWN*(m*blocksize+ty)];

        __syncthreads();

        for(int k=0;k<blocksize;k++)

            Pvalue+=MS[ty][k]*NS[k][tx];

        P[rowM*WP+colN]=Pvalue;

    }

}

If I am not mistaken, each thread will do 2 * (uWM/blocksize) 32 bit global memory reads, and (uWM/blocksize) 32 bit global memory writes. So that is 12 * (uWM/blocksize) bytes per thread. For the [1000x1000][1000x1000] case you quote the 69.2 Gb/s result for, you launch the kernel with uWM = WM = 1000 and blocksize = 2, which gives 24e3 bytes per block. Your code shows that the grid will be 500x500 for blocksize=2, so the total memory throughput for the kernel is 24e3 * 250000 = 6e9 bytes. Your bandwidth formula is assuming 12 * 1000 * 1000 = 12e6 bytes of throughput. So you are reporting more than the theoretical bandwidth of the card, despite underestimating the true memory transaction size of the kernel by a factor of 500.

Unless I have misinterpreted something, there is something fundamentally wrong with this picture.

EDIT: Out by a factor of 2 in a couple of places. Deleted one spurious bit.

X-FrEaK · March 23, 2011, 2:56pm

I read your (great) post very carefully, and from what i understand you are saying that i am reporting an excessive bandwidth, but it should report something even higher. That is strange as the numerator of the bandwidth formula should decrease in order for the bandwidth to decrease to “real” levels. But yes i’ve confirmed your math and it is completely right.

I checked my kernel with the Programming Massively Parallel Processors book from David Kirk, and it was missing one __syncthreads() after the second for, also the Pvalue should be float, not int. Anyway, after those changes, the code is the same as the book and the Bandwidth still has very high values.

So if the kernel is right the only thing i can remember that can influence the result is the measured time, however i think i’ve placed all the CudaEvents right.

avidday · March 23, 2011, 3:06pm

You have zero error checking in your host code. It is quite likely that the kernel is not running at all (or certainly not to completion) and your timing is correct, which leads to a completely meaningless result. I certainly wouldn’t call the kernel “right” either, certainly not with the execution parameters you are using.

X-FrEaK · March 23, 2011, 4:02pm

Finally i have found the problem…

This is how i am compiling: “nvcc -arch sm_20 nonsquare.mat”. I am using sm_20 to “unlock” the 48K Shared memory(i found it somewhere on the internet), but when i compile using that command, the resulting matrix is not written(when i print it, it returns the last matrix that was well calculated), if i compile without using it i have the right result.

I will post some results:

blocksize 2 and 88 matrices: 0,002 Gb/s
blocksize 2 and 4040 matrices: 0,13 Gb/s
blocksize 2 and 20002000 matrices: it looks like the program crashes
blocksize 16 and 20002000 matrices: 0,23 Gb/s
blocksize 16 and 160160 matrices: 1,13 Gb/s
blocksize 16 and 800800 matrices: 0,7 Gb/s

So now I have real results but really bad ones. After the 2000x2000 test the computer just got very slow, so i dont know if the remaining tests were affected(probably yes). Anyway why is the “-arch sm_20” “damaging” my program?

avidday · March 23, 2011, 4:10pm

Your card is a compute 1.3 device. Compiling with arch=sm_20 means you are compiling for the Fermi architecture, which won’t run on your card at all. If your code had some error checking in it, you would have instantly seen this (as well as the nonsense block size problem) and wasted a lot less of your and everyone else’s time.

X-FrEaK · March 23, 2011, 4:23pm

I am really sorry for wasting your time =/ but as i said i am not a programmer(during my engineering graduation i have only had one C language course) so i dont know how to check for errors very well. So concluding, i have only access to 16 KB of shared memory External Image

Topic		Replies	Views
Why my program bandwidth exceeds the standard bandwidth? CUDA Programming and Performance	6	1028	April 3, 2015
Bandwidth calculation Newbie question... CUDA Programming and Performance	10	5422	August 1, 2008
Copy performance on kernel CUDA Programming and Performance	3	3166	December 20, 2007
Maximum bandwith? CUDA Programming and Performance	4	4444	April 16, 2008
Performance doubts CUDA Programming and Performance	0	411	March 16, 2011
Matrix multiplication ERRORS & few thoughts on CUDA Basic programming errors need correction CUDA Programming and Performance	14	13340	January 24, 2009
Is my bandwidth calculation right? bandwidth CUDA Programming and Performance	3	1475	November 13, 2009
Speed-up and bandwidth CUDA Programming and Performance	12	9842	May 4, 2008
global memory bandwidth problem CUDA Programming and Performance	4	1427	March 2, 2010
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13668	June 2, 2008

Effective Bandwidth Problem

Related topics