Device Memory Bandwidth

You7878 · March 23, 2010, 5:15pm

I have two different kernels. First just performs copy. Second performs copy + division. The bandwidth of second kernel seems to be higher. How is it possible?

For first kernel i got 58723 Mb/s (57.6 gb/s official data). For second kernel i got 80744 Mb/s. Device: 8800GT.

[codebox]

extern “C” global void TestFunctionGPU1(float *eli1, float *eli2, float *out, uint size)

{

uint tid = blockIdx.x * blockDim.x + threadIdx.x;

if (tid < size) out[tid] = eli1[tid];

}

[/codebox]

[codebox]

extern “C” global void TestFunctionGPU1(float *eli1, float *eli2, float *out, uint size)

{

uint tid = blockIdx.x * blockDim.x + threadIdx.x;

if (tid < size) out[tid] = eli1[tid] / eli2[tid];

}

[/codebox]

gshi · March 23, 2010, 5:33pm

If your number is higher than theoretical peak, then there is something wrong in your measurement.

Did you synchronize(cudaThreadSynchronize()) your kernel before measuring the time?

I have two different kernels. First just performs copy. Second performs copy + division. The bandwidth of second kernel seems to be higher. How is it possible?

For first kernel i got 58723 Mb/s (57.6 gb/s official data). For second kernel i got 80744 Mb/s. Device: 8800GT.

[codebox]

extern “C” global void TestFunctionGPU1(float *eli1, float *eli2, float *out, uint size)

{
uint tid = blockIdx.x * blockDim.x + threadIdx.x;

if (tid < size) out[tid] = eli1[tid];
}

[/codebox]

[codebox]

extern “C” global void TestFunctionGPU1(float *eli1, float *eli2, float *out, uint size)

{
uint tid = blockIdx.x * blockDim.x + threadIdx.x;

if (tid < size) out[tid] = eli1[tid] / eli2[tid];
}

[/codebox]

You7878 · March 23, 2010, 7:06pm

I did the same in both cases.

[codebox]

DateTime start = DateTime.Now;

int numIterations = 1000;

for (int i = 0; i < numIterations; i++)

{

cuda.Launch(function, (int)(size + BlockSize - 1) / BlockSize, 1);

}

cuda.SynchronizeContext();

float Time = (float)(DateTime.Now - start).TotalMilliseconds;

Console.WriteLine(“Bandwidth: {0} Mb/s\n”, size * sizeof(int) * Time / numIterations);

[/codebox]

eelsen · March 23, 2010, 8:27pm

I did the same in both cases.

[codebox]

DateTime start = DateTime.Now;

int numIterations = 1000;

for (int i = 0; i < numIterations; i++)

{
cuda.Launch(function, (int)(size + BlockSize - 1) / BlockSize, 1);
}

cuda.SynchronizeContext();

float Time = (float)(DateTime.Now - start).TotalMilliseconds;

Console.WriteLine(“Bandwidth: {0} Mb/s\n”, size * sizeof(int) * Time / numIterations);

[/codebox]

This is a guess, since I’m not sure what cuda.Launch is, but: Are you launching with 1 thread per block? Did you mean to use BlockSize as the number of threads instead of 1?

You7878 · March 23, 2010, 8:31pm

BlockSize is the number of threads per block. It is defined as 256.

eelsen · March 24, 2010, 12:28am

Yeah, I realize that. But the way you are calling the cuda.Launch function it looks like you are passing 1 as the number of threads, when you should be passing blockSize.

You7878 · March 24, 2010, 6:14am

but i have thread block of size 256 x 1. i do not use 2d block here.

You7878 · March 24, 2010, 9:27am

Well, the problem was with formula: size * sizeof(int) * Time / numIterations
It should be: 0.000001 * numIterations * size * sizeof(int) / Time

Now i get next results: 60 gb/s for both kernels on GTX 275.
why it is so far from theoretical (127 gb/s) or from shown by bandwidth test (105000 mb/s)?

Skybuck · March 29, 2015, 3:17am

GPU RAM probably bottlenecked by GPU processor ;)

Topic		Replies	Views
Maximum bandwith? CUDA Programming and Performance	4	4423	April 16, 2008
Effective Bandwidth Problem CUDA Programming and Performance	13	7709	March 23, 2011
Measuring Effective Bandwidth CUDA Programming and Performance	1	4643	February 20, 2011
Low Bandwidth with simple data copy CUDA Programming and Performance	4	9116	December 7, 2011
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
Copy performance on kernel CUDA Programming and Performance	3	3138	December 20, 2007
Memory copy by two CUDA kernels - why speed differs? CUDA Programming and Performance	10	668	September 28, 2018
Effective global memory bandwidth? CUDA Programming and Performance	17	17571	September 18, 2007
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9347	January 7, 2008
Using bandwidthTest tool, D2D performance More than the official given bandwidth CUDA Programming and Performance cuda	6	846	October 28, 2022

Device Memory Bandwidth

Related topics