Effective Bandwidth Problem

Hi all.

I have a CUDA kernel that multiplies two matrices which Width and Height are multiples of the blocksize i am using.

The Nvidia Quadro Fx 3800 I am using has a theoretical bandwidth of 50 Gb/s and I am having some strange results(Effective Bandwidth larger than Theoretical Bandwidth)

here are some results:

With Blocksize 2

[10][10] * [10][10] → BW=0,02 Gb/s

[1000][1000]*[1000][1000] → BW=69,4 Gb/s

With Blocksize 64

[1000][1000] * [1000][1000] → BW=486,4 Gb/s

[10000][10000] * [10000][10000] → BW= 45072,12 Gb/s

I took the effective bandwidth formula from the Nvidia Best Practices Guide(I have simplified it but its equivalent(unless there is a stupid mistake)). I think the kernel is fine as its very similar(if not equal) to some Nvidia Lectures I read and also because its working properly(afaik).

#define blocksize 64

#define HM (10000) 

#define WM (10000) 

#define WN (10000)

#define HN WM 

#define WP WN   

#define HP HM  

#define PTH WM

#define PTW HM

__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)

   	{

	__shared__ float MS[blocksize][blocksize];

	__shared__ float NS[blocksize][blocksize];

	int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;

	int rowM=ty+by*blocksize;

	int colN=tx+bx*blocksize;

	int Pvalue=0;

	for(int m=0; m< uWM/blocksize;m++){

	    MS[ty][tx]=M[rowM*uWM+(m*blocksize+tx)];

	    NS[ty][tx]=M[colN + uWN*(m*blocksize+ty)];

	__syncthreads();

	    for(int k=0;k<blocksize;k++)

        	Pvalue+=MS[ty][k]*NS[k][tx];

	P[rowM*WP+colN]=Pvalue;

	}

	}

int main(){

	cudaEvent_t evstart, evstop;

	cudaEventCreate(&evstart);

	cudaEventCreate(&evstop);

	float*M=(float*)malloc(sizeof(float)*HM*WM);

	float*N=(float*)malloc(sizeof(float)*HN*WN);

	for(int i=0;i<WM*HM;i++)

	    M[i]=(float)i;

	for(int i=0;i<WN*HN;i++)

	    N[i]=(float)i;

	float*P=(float*)malloc(sizeof(float)*HP*WP);

	float *Md,*Nd,*Pd;

	cudaMalloc((void**)&Md,HM*WM*sizeof(float));

	cudaMalloc((void**)&Nd,HN*WN*sizeof(float));

	cudaMalloc((void**)&Pd,HP*WP*sizeof(float));

	cudaMemcpy(Md,M,HM*WM*sizeof(float),cudaMemcpyHostToDevice);

	cudaMemcpy(Nd,N,HN*WN*sizeof(float),cudaMemcpyHostToDevice);

	dim3 dimBlock(blocksize,blocksize);//(tile_width , tile_width);

	dim3 dimGrid(WN/dimBlock.x,HM/dimBlock.y);//(width/tile_width , width/tile_witdh);

	cudaEventRecord(evstart,0);

	nonsquare<<<dimGrid,dimBlock>>>(Md,Nd,Pd,WM,WN);

	cudaEventRecord(evstop,0);

	cudaEventSynchronize(evstop);

	float time;

	cudaEventElapsedTime(&time,evstart,evstop);

	cudaMemcpy(P,Pd,WP*HP*sizeof(float),cudaMemcpyDeviceToHost);

	cudaFree(Md);

	cudaFree(Nd);

	cudaFree(Pd);

	printf("\ntime spent:%f",time);

	float Bandwidth=(HM*WM*4+WN*HN*4+HP*WP*4)/(time*1000000);

	printf("\nEffective Bandwidth:%f Gb/s\n",Bandwidth);

    	}

How does the effective bandwidth surpasses the theoretical?

I would really appreciate your help as my life depends on this(almost litteraly), thanks in advance!

you can’t launch a 64x64 block, you’re timing nothing

Jesus…I was so blind that i could go to 512x512…So the maximum of a 2D block is 16x16 then…I am really sorry, i feel kind of dumb External Image.

That problem aside, there arent any problems with the bandwidth formula right?

tmurray, sorry but although what you said is true, the effective bandwidth for blocksize two in the 1000x1000 example is 69 Gb/s, which is larger than the theoretical 50 Gb/s

Your bandwidth formula is most probably wrong. I would guess compiler optimization will greatly reduce the number of writes done down to 1 per thread.

Hum, i rechecked the formula and i think its good, but i might be thinking wrong.

So the program does a matrix matrix multiplication, my bandwidth formula is then:

Bytes read= MWidthMHeight4(float) + NWidthNHeight4(float)
Bytes written= PWidthPHeight4(float)

And then i divide by “time*100000”(which contains the 10^9 bytes to Gb factor and the 10-³ milliseconds to seconds factor)

Is something wrong with this line of thought?

There is nothing wrong with that line of thought. I am questioning whether your kernel code actually does what your bandwidth formula assumes it does.

Sorry if this is completely stupid but:

I have tried a lot of different matrices to do the multiplication, and the results are fine.(I dont know if this implies that the kernel is doing what the bandwidth formula assumes. Also I’m sorry for my insecurity, but i had very little preparation before i started working with CUDA and before i start implementing the ART and OSEM algorithms i have to make sure im measuring the bandwidth right)

I am not saying there is necessarily anything wrong - I am just very skeptical that the code you posted performs as you say it does. Looking at the kernel again:

__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)

{

    __shared__ float MS[blocksize][blocksize];

    __shared__ float NS[blocksize][blocksize];

int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;

    int rowM=ty+by*blocksize;

    int colN=tx+bx*blocksize;

    int Pvalue=0;

for(int m=0; m< uWM/blocksize;m++){

        MS[ty][tx]=M[rowM*uWM+(m*blocksize+tx)];

        NS[ty][tx]=M[colN + uWN*(m*blocksize+ty)];

        __syncthreads();

        for(int k=0;k<blocksize;k++)

            Pvalue+=MS[ty][k]*NS[k][tx];

        P[rowM*WP+colN]=Pvalue;

    }

}

If I am not mistaken, each thread will do 2 * (uWM/blocksize) 32 bit global memory reads, and (uWM/blocksize) 32 bit global memory writes. So that is 12 * (uWM/blocksize) bytes per thread. For the [1000x1000][1000x1000] case you quote the 69.2 Gb/s result for, you launch the kernel with uWM = WM = 1000 and blocksize = 2, which gives 24e3 bytes per block. Your code shows that the grid will be 500x500 for blocksize=2, so the total memory throughput for the kernel is 24e3 * 250000 = 6e9 bytes. Your bandwidth formula is assuming 12 * 1000 * 1000 = 12e6 bytes of throughput. So you are reporting more than the theoretical bandwidth of the card, despite underestimating the true memory transaction size of the kernel by a factor of 500.

Unless I have misinterpreted something, there is something fundamentally wrong with this picture.

EDIT: Out by a factor of 2 in a couple of places. Deleted one spurious bit.

I read your (great) post very carefully, and from what i understand you are saying that i am reporting an excessive bandwidth, but it should report something even higher. That is strange as the numerator of the bandwidth formula should decrease in order for the bandwidth to decrease to “real” levels. But yes i’ve confirmed your math and it is completely right.

I checked my kernel with the Programming Massively Parallel Processors book from David Kirk, and it was missing one __syncthreads() after the second for, also the Pvalue should be float, not int. Anyway, after those changes, the code is the same as the book and the Bandwidth still has very high values.

So if the kernel is right the only thing i can remember that can influence the result is the measured time, however i think i’ve placed all the CudaEvents right.

You have zero error checking in your host code. It is quite likely that the kernel is not running at all (or certainly not to completion) and your timing is correct, which leads to a completely meaningless result. I certainly wouldn’t call the kernel “right” either, certainly not with the execution parameters you are using.

Finally i have found the problem…

This is how i am compiling: “nvcc -arch sm_20 nonsquare.mat”. I am using sm_20 to “unlock” the 48K Shared memory(i found it somewhere on the internet), but when i compile using that command, the resulting matrix is not written(when i print it, it returns the last matrix that was well calculated), if i compile without using it i have the right result.

I will post some results:

blocksize 2 and 88 matrices: 0,002 Gb/s
blocksize 2 and 40
40 matrices: 0,13 Gb/s
blocksize 2 and 20002000 matrices: it looks like the program crashes
blocksize 16 and 2000
2000 matrices: 0,23 Gb/s
blocksize 16 and 160160 matrices: 1,13 Gb/s
blocksize 16 and 800
800 matrices: 0,7 Gb/s

So now I have real results but really bad ones. After the 2000x2000 test the computer just got very slow, so i dont know if the remaining tests were affected(probably yes). Anyway why is the “-arch sm_20” “damaging” my program?

Your card is a compute 1.3 device. Compiling with arch=sm_20 means you are compiling for the Fermi architecture, which won’t run on your card at all. If your code had some error checking in it, you would have instantly seen this (as well as the nonsense block size problem) and wasted a lot less of your and everyone else’s time.

I am really sorry for wasting your time =/ but as i said i am not a programmer(during my engineering graduation i have only had one C language course) so i dont know how to check for errors very well. So concluding, i have only access to 16 KB of shared memory External Image