Memory bandwidth

Hi,

I have a question about memory bandwidth when using CUDA. I wrote a simple test (see the code below), and got next results: copying 256MB to the device takes about 490-500 ms, it’s approximately 500MB/s bandwidth. AFAIK, peak bandwidth of the PCI-E 16x bus is about 5GB/s. In section 5.1 I’ve found, that in Beta version 0.8, maximum observed bandwidth between system memory and device memory is 2 GB/s. Am I doing something wrong?

extern "C" void runTest(short* data, int width, int height);

#define DATA_WIDTH	1024*256

#define DATA_HEIGHT	1024

int main(int argc, char* argv)

{

	short *data;

	data = new short[DATA_WIDTH * DATA_HEIGHT];

	runTest(data, DATA_WIDTH, DATA_HEIGHT);

	CUT_EXIT(argc, argv);

}

extern "C" void runTest(short* data, int width, int height)

{

	CUT_CHECK_DEVICE();

	unsigned int hTimer;

	const unsigned int mem_size = sizeof(short) * width * height;

	short* d_data;

	CUDA_SAFE_CALL(cudaMalloc((void**) &d_data, mem_size));

	cutCreateTimer(&hTimer);

	cutStartTimer(hTimer);

	CUDA_SAFE_CALL(cudaMemcpy(d_data, data, mem_size, cudaMemcpyHostToDevice) );

	cutStopTimer(hTimer);

	CUT_CHECK_ERROR("Kernel execution failed");

	CUDA_SAFE_CALL(cudaMemcpy(data, d_data, mem_size, cudaMemcpyDeviceToHost) );

	CUDA_SAFE_CALL(cudaFree(d_data));

	printf("gpu time: %f\n", cutGetTimerValue(hTimer));

}

The size of a short is 2 bytes, so you’re actually getting about 1GB/s. Have you tried other array sizes (128MB, 512MB)? It’d be interesting to see what happens with those. You could also see if using CudaMalloc2D and CudaMemcpy2D give you performance that’s any better.

Paulius

P.S. You may also want to check out the bandwidth testing Alex Tutubalin has carried out (http://blog.lexa.ru/2007/03/08/nvidia_8800gtx_propusknaja_sposobnost__pamjati_pri_ispol_zovanii_cuda.html). They’re not the same as your tests, though - he’s tesing reads from global into shared memory.

The transfer time is also dependent on your CPU/motherboard.
Your code on my machine ( Opteron 250, RHEL4 32bit) runs in:

gpu time: 379.431000

Massimiliano

Yes, I’ve made a mistake, the bandwith is about 1GB/s. Thanks for advice, I’ll try to run my test with different data size.

Massimiliano, thanks. I run my test on the Pentium D 820, motherboard is based on the i945 chipset. But anyway, memory bandwidth in my case is two times lower than the maximum observed. Are there any dependencies between data type and size which may cause such poor results?

So, can anybody post here your own results? :)

Athlon64 X2 4600+ w/ 8800 GTX, Ubuntu 6.06
gpu time: 374.054993
(Edit: as per Mark’s suggestion, I should note this is an NForce4 motherboard)

gpu time: 388.231995
PentiumD 3GHz, 8800 GTX, openSuSE 10.2

Peter

Note you should also be comparing based on your motherboard chipset. Recent nForce chipsets will probably perform better than others.

Mark

I was also doing my own tests, when I found this. I’m using a Core 2 Duo 2.4GHz with a NV80 card from EVGA and an Asus P5LD2-VM (Intel 945G). I am running SUSE 10.1 with kernel 2.6.16.27-0.9-smp.

I got 357.86ms for the test, which means about 1.4GB/sec. Not so much considering the 4 GB/s in each direction specified by the standard … I wonder where is the limitation …

Strange things came up when I start playing a little with the code. First, the speed seems to depend on if the data is initialized or not. Second, host → device is faster than device → host

Not inited (like in the test posted above):
(host → device): 356.212006 = 1.403658GB/s
(device → host): 958.921021 = 0.521419GB/s

Inited (all to 1.0 or something)
(host → device): 396.584991 = 1.260764GB/s
(device → host): 506.686005 = 0.986804GB/s

Why would that be?

I also computed the latency:
(host → device): 3209.617920 = 12.24us
(device → host): 3114.749023 = 11.885us

Is there anyone else who computed it? Of course, not so important as the bandwidth but … it would be nice to know.

Regards,
Serban

The code:

#include <stdio.h>
#include <stdlib.h>
#include <cutil.h>
#include <string.h>

#define DATA_WIDTH 1024*256
#define DATA_HEIGHT 1024
#define VALUE 1.0
#define FORMAT short

void runTest_bandwidth(FORMAT* h_data, int width, int height)
{
CUT_CHECK_DEVICE();
unsigned int hTimer;
const unsigned int mem_size = sizeof(FORMAT) * width * height;

// Allocate on device
FORMAT* d_data;
CUDA_SAFE_CALL(cudaMalloc((void**) &d_data, mem_size));

// Run and time
cutCreateTimer(&hTimer);
cutStartTimer(hTimer);
CUDA_SAFE_CALL(cudaMemcpy(d_data, h_data, mem_size, cudaMemcpyHostToDevice));
cutStopTimer(hTimer);
printf(“Bandwidth test (host → device): %f = %fGB/s\n”, cutGetTimerValue(hTimer), 1000.0 * mem_size / (1024.0 * 1024.0 * 1024.0 * cutGetTimerValue(hTimer)));

//Time again backwords
CUT_CHECK_ERROR(“Kernel execution failed”);
cutCreateTimer(&hTimer);
cutStartTimer(hTimer);
CUDA_SAFE_CALL(cudaMemcpy(h_data, d_data, mem_size, cudaMemcpyDeviceToHost) );
CUDA_SAFE_CALL(cudaFree(d_data));
cutStopTimer(hTimer);
printf(“Bandwidth test (device → host): %f = %fGB/s\n”, cutGetTimerValue(hTimer), 1000.0 * mem_size / (1024.0 * 1024.0 * 1024.0 * cutGetTimerValue(hTimer)));

CUDA_SAFE_CALL(cudaFree(d_data));
// Check
double sum = 0;
for (long i = 0; i < width * height; i++)
sum += h_data[i];
printf(“CRC: %f\n”, sum - VALUE * width * height);
}

void runTest_latency(FORMAT* h_data, int width)
{
CUT_CHECK_DEVICE();
unsigned int hTimer;

// Allocate on device
const unsigned int mem_size = sizeof(FORMAT) * 1;
FORMAT* d_data;
CUDA_SAFE_CALL(cudaMalloc((void**) &d_data, mem_size));

// Run and time
cutCreateTimer(&hTimer);
cutStartTimer(hTimer);
for (long i = 0; i < width; i++)
  CUDA_SAFE_CALL(cudaMemcpy(d_data, h_data, mem_size, cudaMemcpyHostToDevice));
cutStopTimer(hTimer);
printf("Latency test (host -> device): %f = %fus\n", cutGetTimerValue(hTimer), 1000 * cutGetTimerValue(hTimer) / (width));

cutCreateTimer(&hTimer);
cutStartTimer(hTimer);
for (long i = 0; i < width; i++)
  CUDA_SAFE_CALL(cudaMemcpy(h_data, d_data, mem_size, cudaMemcpyDeviceToHost));
cutStopTimer(hTimer);
printf("Latency test (device -> host): %f = %fus\n", cutGetTimerValue(hTimer), 1000 * cutGetTimerValue(hTimer) / (width));


CUDA_SAFE_CALL(cudaFree(d_data));

}

int main(int argc, char* argv)
{
// Declare and allocate
FORMAT *h_vector;
h_vector = new FORMAT[DATA_WIDTH * DATA_HEIGHT];

// Init
for (long i = 0; i < DATA_WIDTH * DATA_HEIGHT; i++)
h_vector[i] = VALUE;

// Run tests
runTest_bandwidth(h_vector, DATA_WIDTH, DATA_HEIGHT);
runTest_latency(h_vector, DATA_WIDTH);

delete h_vector;
CUT_EXIT(argc, argv);
}

In the new examples (part of the SDK 0.8.1), there is a bandwidth test sample that demonstrates how to achieve over 3GB/second between the host and GPU using page-locked memory allocation.

Massimiliano

I tried the new “pinned” method and now I get twice as before:

host → device = 2551.5

device → host = 1874.0

device → device = 10458.3

Thanks for the tip

Serban

That still looks a bit low. This could be due to your motherboard chipset and/or CPU.

The device->device bandwidth will be improved in the next CUDA Toolkit release.

Mark

For your info: I see for pageable mem

H->D 1334
D->H 1134
D->D 9564

and for pinned mem

H->D 3172
D->H 3227
D->D 9564

3.0 GHz P4 HT, mainboard NVIDIA CK804 chipset, 8800 GTX

Peter

Since the program is the one from the samples, it should be the chipset. There is both the difference in bandwidth and the fact that host2device quite faster than device2host. I shall try with another MB and see if something changes.

Thanks,

Serban

I updated one of my own test kernels to use pinned memory like the NVIDIA sample code, and I ran it on our two multi-GPU CUDA test machines. Both are using the Asus P5N32-E SLI motherboards with Intel QX6700 (Intel Core 2 Quad) CPUs, one has 3 GPUs in it, and the other currently has two. (the only difference between them is the second machine doesn’t have a big enough PSU to run 3 GPUs yet…)

The pinned memory makes a HUGE difference in the performance of the GPU in the first 16x slot. The 8x slot is clearly limiting the performance of the pinned memory version. The second 16x slot doesn’t perform as well as the first one in these motherboards for some reason, but does show some benefit from the use of the pinned memory. I just ran these tests once each and didn’t bother with averaging numerous runs or varying block sizes yet, but they give you a good ballpark figure.

My tests were done on 256MB buffers.

As you can see below, both machines behave the same way performance-wise:

Machine 1 results:

CUDA device ID 0 (GeForce 8800GTX) is in the first 16x slot, it got these results:

  host to GPU copy bandwidth: 1619.50MB/sec, 1.58 seconds total

  PINNED host to GPU copy bandwidth: 3981.21MB/sec, 0.64 seconds total

  GPU to host copy bandwidth: 1597.27MB/sec, 1.60 seconds total

  PINNED GPU to host copy bandwidth: 3906.01MB/sec, 0.66 seconds total

CUDA device ID 1 (GeForce 8800GTX) is in an 8x slot, it got these results:

  host to GPU copy bandwidth: 1526.20MB/sec, 1.68 seconds total

  PINNED host to GPU copy bandwidth: 1590.57MB/sec, 1.61 seconds total

  GPU to host copy bandwidth: 1552.65MB/sec, 1.65 seconds total

  PINNED GPU to host copy bandwidth: 1637.62MB/sec, 1.56 seconds total

CUDA device ID 2 (GeForce 8800GTX) is in the second 16x slot, it got these results:

  host to GPU copy bandwidth: 1577.39MB/sec, 1.62 seconds total

  PINNED host to GPU copy bandwidth: 2162.57MB/sec, 1.18 seconds total

  GPU to host copy bandwidth: 1601.93MB/sec, 1.60 seconds total

  PINNED GPU to host copy bandwidth: 2101.01MB/sec, 1.22 seconds total

Machine 2:

CUDA device ID 0 (GeForce 8800GTX) is in the first 16x slot, it got these results:

  host to GPU copy bandwidth: 1681.38MB/sec, 1.52 seconds total

  PINNED host to GPU copy bandwidth: 3957.77MB/sec, 0.65 seconds total

  GPU to host copy bandwidth: 1621.61MB/sec, 1.58 seconds total

  PINNED GPU to host copy bandwidth: 3717.91MB/sec, 0.69 seconds total

CUDA device ID 1 (GeForce 8800GTX) is in an 8x slot, it got these results:

  host to GPU copy bandwidth: 1666.19MB/sec, 1.54 seconds total

  PINNED host to GPU copy bandwidth: 2162.73MB/sec, 1.18 seconds total

  GPU to host copy bandwidth: 1648.24MB/sec, 1.55 seconds total

  PINNED GPU to host copy bandwidth: 2102.59MB/sec, 1.22 seconds total

I’m curious what other people are seeing on their test machines, and what hardware they are using…

Cheers,

John Stone

Interesting results! Thanks John!

On my settings, ASUS Striker Extreme (680SLi) + E6300 (no oc) + 8800GTX (on 3rd slot, no display) + 2G Trasncand DDR2 800 (CL5) + WindowsXP, I have the results (well…frankly speaking, it’s not really impressive…):
quick mode, pinned memory
H->D: 2159.5
D->H: 2097.8
D->D: 8730.8

quick mode, pageable memory
H->D: 1574.2
D->H: 1629.1
D->D: 8729.5

Will CUDA on Linux perform better than CUDA on XP?

On the other hand, in fact I’m also planning for dual or even triple 8800GTX…from John’s result, it seems like the 8x signal does not hurt the performace so much in current release, since there’re all far from theory.

I was wondering if there’s someone who has ever tried 8800GTX on those server MBs…like TYAN S2915, which has 2x16, 2x8…that is, if we can remove the giant cooler on 8800GTX and modify the plate a little bit…we can have up to 4 8800GTX on a single MB!!! (ok, I know I need to combine two 1000W PSUs into one machine)

In this case, it will be very very very IMPRESSIVE…only if the performance of 8800GTX on these server boards are not far from those desktop boards.

Any idea?

I forgot to mention that my tests were all run on Linux (RHEL4 update 4).
My best performance results with pinned memory buffers are very close to the peak performance that PCIe x16 slots are supposed to be able to achieve in each direction (4 gigabytes per second), if I’ve correctly understood the PCIe hardware specs. I have no clue why the performance of the second x16 slot is so much lower than the first one, perhaps one of our NVIDIA experts can shed light on how the PCIe bus layout might affect these things. I’m very curious to hear more performance numbers for machines with multiple GPUs, and whether or not other people observe differeing performance levels for theoretically equivalent “x16” slots like I’m seeing or not.

Cheers,
John Stone

I’m curious if anyone else has tested memory bandwidth on multiple GPU hardware configurations and found a motherboard (or perhaps a Linux kernel?) that yields full speed on multiple slots? Using page-locked memory allocations, our test machines based on Asus P5N32-E SLI motherboards reach almost 4GB/sec for the primary PCIe x16 slot, but the other x16 slot doesn’t perform anywhere near as well at just over 2GB/sec, just a little over what we got for the x8 slot.

John

Is that with both x16 slots occupied? What is the bandwidth if you only have one card in the system in the second x16 slot?