Lower then expected bandwidth on C2050

I just received my new bright and shiny Fermi C2050 last week and was
impressed with the performance gains over our existing GTX 260s and C1060s
in both single and especially double precision. While going through the
examples in the SDK, I noticed that the device->device bandwidth as reported
by the bandwidthTest in the CUDA SDK is rather low. The bandwidthTest is
reporting ~77GB/sec with ECC enabled and ~86GB/sec with ECC disabled. The
specifications on NVIDIA’s site show a peak theoretical bandwidth of
144GB/sec, and while I don’t expect to see all of that, I expected at least
120GB/second from this card. Are there known performance issues with the
current latest drivers and the new Teslas? I’m using driver 195.36.15 on
Centos 5.4 x86_64. Does anyone with a Fermi have any bandwidth numbers to report?

The internal memcpys are still not fully optimized for GF100.

Btw, what is your speed up?

For kernels that are computationally bound (in DP), I’m seeing a speedup of around 4.

Is this still going on, I have been seeing some performance issues on my cards as well.

On the C2050 I have ECC enabled.

If I run bandwidthTest from 3.1 sdk on C2050 I get:

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 79301.8

Whereas on C1060 I get:

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 73639.6

Neither of these numbers seems correct, shouldn’t they be closer to the theoretical peaks? Both devices are connected to the same machine, it is actually connected to 4 GPUs:

System: Velocity Micro D5400XS, 1 socket Quad core Xeon 4 x 2800Mhz, 16GB, Red Hat Linux 5.0, kernel 2.6.18-8.el5, 64 bit system,

GPUs: 1 x Quadro FX 370, 1 x Tesla C1060, 2 x Tesla C2050 (Fermi)

Any reason for this? Shouldn’t you just be able to launch a kernel that does the memcpy and pull the implementation directly from something that is optimized like thrust?

I did a few tests comparing thrust::copy, a cuda kernel, and cudaMemcpy to copy values, I’ll attach the code I used below. I seem to get between 75 and 98 GB/s. I do have ECC enabled so that could be hurting performance here. Interesting the double copy is faster than using cudaMemcpy.

My copy kernel is:

[codebox]template

global void copy_kernel(const unsigned int N,

const Value * src, Value * dst)

{

unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

{

    dst[i] = src[i];

}

}

[/codebox]

size MB (using M = 1024^2)

76.29394531 (10000000 double precision numbers)

transfer size (using M = 1000^2)

210M8 (1 read and 1 write for each double precision number, for the different types I match the total size by varying the number)

160 MB

The GB/s below uses G = 1000^3.

[codebox]

Operation time ms GB/s

thrust::copy 1.9131 83.63

cudaMemcpy 1.912 83.68

Simple cuda_kernel that copies arrays of different data types

double 1.6469 97.15

double2 1.7633 90.73

float4 1.7619 90.81

float 2.1492 74.45[/codebox]

I used the performance timer from the cusp library http://code.google.com/p/cusp-library/, and there is also the thrust copy call.

I am a bit surprised that the simple copy kernel with 64 bit value is the fastest, what is the explanation for this behaviour? Any ideas?

I compiled with:

nvcc -O3 -o copy -arch=sm_20 vector_permute.cu

I also tried disabling the L1 but that didn’t change the performance
vector_copy.cu (3.26 KB)

I updated the copy program so the only dependency is thrust now. There was a bug in the old one as it launched too many threads for larger data types. I would be interested to see what numbers other people get. I’m using a C2050 with ECC enabled. Compiled with nvcc 3.1 on a 64 bit machine using:

nvcc -O3 -o copy -arch=sm_20 vector_copy.cu

The test is the same as before transferring a fixed size chunk of data using thrust::copy, cudaMemcpy, and by using a simple cuda kernel to copy arrays of different data types (float, float2, double, etc) with the same overall memory size.

My results are:

rostrup@teralab-25:/remote/t3dev1/rostrup/projects/thrust_egs/permutation % ./copy 0 8000000

Running on Device 0: “Tesla C2050”

Time in ms:

N thrust memcpy float float2 float4 double double2

8000000 1.5267 1.5240 1.7131 1.3206 1.2740 1.3228 1.2762

Transfer rate in GB/s (1 Read and 1 Write):

MB thrust memcpy float float2 float4 double double2

128 83.8 84.0 74.7 96.9 100.5 96.8 100.3

So the maximum I get is 100 GB/s and it is when loading and storing 128bit values (float4 or double2).

I am wondering a few things:

    [*]if I have a configuration problem on my card which is why I don’t get closer to peak (144 GB/s)

    [*]how much does ECC cut down on bandwidth (I don’t have access to my machine, so I can’t enable/disable ECC easily)

    [*]are there any other optimizations I could do to bring the benchmark closer to peak

Thanks

Scott

having trouble with the attachment uploader

[indent]Upload failed. Please ask the administrator to check the settings and permissions[/indent]

so I’m pasting the file in a codebox, sorry:

[codebox]//Uses the thrust containers and algorithms to copy data

//Time performance to estimate GB/s read and write

#include <thrust/host_vector.h>

#include <thrust/device_vector.h>

#include <thrust/generate.h>

#include <thrust/copy.h>

#include

#include

//From Nvidia spmv

void set_device(int dev)

{

cudaSetDevice(dev);

cudaDeviceProp deviceProp;

cudaGetDeviceProperties(&deviceProp, dev);

printf("\nRunning on Device %d: \"%s\"\n", dev, deviceProp.name);

}

//**************************************************

//cusp/performance/timer.h

/*

  • Copyright 2008-2009 NVIDIA Corporation

  • Licensed under the Apache License, Version 2.0 (the “License”);

  • you may not use this file except in compliance with the License.

  • You may obtain a copy of the License at

  •  <a target='_blank' rel='noopener noreferrer' href='"http://www.apache.org/licenses/LICENSE-2.0"'>http://www.apache.org/licenses/LICENSE-2.0</a>
    
  • Unless required by applicable law or agreed to in writing, software

  • distributed under the License is distributed on an “AS IS” BASIS,

  • WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

  • See the License for the specific language governing permissions and

  • limitations under the License.

*/

//#pragma once

// A simple timer class

#include <cuda.h>

class timer

{

cudaEvent_t start;

cudaEvent_t end;

public:

timer()

{ 

    cudaEventCreate(&start); 

    cudaEventCreate(&end);

    cudaEventRecord(start,0);

}

~timer()

{

    cudaEventDestroy(start);

    cudaEventDestroy(end);

}

float milliseconds_elapsed()

{ 

    float elapsed_time;

    cudaEventRecord(end, 0);

    cudaEventSynchronize(end);

    cudaEventElapsedTime(&elapsed_time, start, end);

    return elapsed_time;

}

float seconds_elapsed()

{ 

    return milliseconds_elapsed() / 1000.0;

}

};

//**************************************************

double random_double()

{

return (rand()/(RAND_MAX+1.0));

}

template

global void copy_kernel(const unsigned int N,

const Value * src, Value * dst)

{

unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

{

    dst[i] = src[i];

}

}

//Time is in ms

//MB/ms → GB/s when using SI units

//size is in Bytes

double gb_s(const unsigned int size, double time)

{

return ((double)size/1000000.0)/time;

}

int main(int argc, char * argv)

{

//Get a vector length from the command line

if(argc!=3)

{

	printf("Usage: ./copy device vec_length\n");

	exit(1);

}

int device = atoi(argv[1]);

set_device(device);

unsigned int N = atoi(argv[2]);

srand(time(NULL));

//Declare two vectors of random doubles on the host

thrust::host_vector<double> hx(N);

thrust::host_vector<double> hy(N);

thrust::generate(hx.begin(),hx.end(),random_double);

thrust::generate(hy.begin(),hy.end(),random_double);

//Copy Them to The Device

thrust::device_vector<double> dx(hx.begin(),hx.end());

thrust::device_vector<double> dy(hy.begin(),hy.end());

//Warmup

timer one_copy;

thrust::copy(dx.begin(),dx.end(),dy.begin());

double warm_up = one_copy.milliseconds_elapsed();

//printf("Warm Up Time = %.3f ms\n",warm_up);

//Timing Copy

int num_iterations = 10;

timer copy_timer;

for(int i=0;i<num_iterations;i++)

    thrust::copy(dx.begin(),dx.end(),dy.begin());

double copy_time = copy_timer.milliseconds_elapsed()/(double)num_iterations;

//Timing Copy Kernel

//Maximum number for sm_20 1024, sm_13 512

const int NUM_THREADS = 1024;

int num_blocks = (N+NUM_THREADS-1)/NUM_THREADS;

double * dx_raw = thrust::raw_pointer_cast(&dx[0]);

double * dy_raw = thrust::raw_pointer_cast(&dy[0]);

//Set L1 Affinity

//cudaFuncSetCacheConfig(scatter_kernel<int,double>,cudaFuncCachePreferL1);

//cudaFuncSetCacheConfig(gather_kernel<int,double>,cudaFuncCachePreferL1);

timer copy_double_timer;

for(int i=0;i<num_iterations;i++)

    copy_kernel<double> <<<num_blocks,NUM_THREADS>>>(N,dx_raw,dy_raw);

double copy_double_time = copy_double_timer.milliseconds_elapsed()/(double)num_iterations;

float * fx_raw = (float*)dx_raw;

float * fy_raw = (float*)dy_raw;

num_blocks = (2*N+NUM_THREADS-1)/NUM_THREADS;

timer copy_float_timer;

for(int i=0;i<num_iterations;i++)

    copy_kernel<float> <<<num_blocks,NUM_THREADS>>>(2*N,fx_raw,fy_raw);

double copy_float_time = copy_float_timer.milliseconds_elapsed()/(double)num_iterations;

float2 * fx2_raw = (float2*)dx_raw;

float2 * fy2_raw = (float2*)dy_raw;

num_blocks = (N+NUM_THREADS-1)/NUM_THREADS;

timer copy_float2_timer;

for(int i=0;i<num_iterations;i++)

    copy_kernel<float2> <<<num_blocks,NUM_THREADS>>>(N,fx2_raw,fy2_raw);

double copy_float2_time = copy_float2_timer.milliseconds_elapsed()/(double)num_iterations;

double2 * dx2_raw = (double2*)dx_raw;

double2 * dy2_raw = (double2*)dy_raw;

num_blocks = (N/2+NUM_THREADS-1)/NUM_THREADS;

timer copy_double2_timer;

for(int i=0;i<num_iterations;i++)

    copy_kernel<double2> <<<num_blocks,NUM_THREADS>>>(N/2,dx2_raw,dy2_raw);

double copy_double2_time = copy_double2_timer.milliseconds_elapsed()/(double)num_iterations;

float4 * dx4_raw = (float4*)dx_raw;

float4 * dy4_raw = (float4*)dy_raw;

num_blocks = (N/2+NUM_THREADS-1)/NUM_THREADS;

timer copy_float4_timer;

for(int i=0;i<num_iterations;i++)

    copy_kernel<float4> <<<num_blocks,NUM_THREADS>>>(N/2,dx4_raw,dy4_raw);

double copy_float4_time = copy_float4_timer.milliseconds_elapsed()/(double)num_iterations;

timer copy_memcpy_timer;

for(int i=0;i<num_iterations;++i)

    cudaMemcpy(dx_raw,dy_raw,sizeof(double)*N,cudaMemcpyDeviceTo

Device);

double copy_memcpy_time = copy_memcpy_timer.milliseconds_elapsed()/(double)num_iterations;

//Print times in ms

printf("Time in ms:\n");

printf("N thrust memcpy float float2 float4 double double2\n");

printf("%d %.4f %.4f %.4f %.4f %.4f %.4f %.4f\n\n",N,copy_time,copy_memcpy_time,copy_float_time,copy_floa

t2_time,copy_float4_time, copy_double_time,copy_double2_time);

//Print GB/s in SI units

printf("Transfer rate in GB/s (1 Read and 1 Write):\n");

printf("10^6 B thrust memcpy float float2 float4 double double2\n");

unsigned int size = 2*N*sizeof(double);

printf("%d %.1f %.1f %.1f %.1f %.1f %.1f %.1f\n",size/1000000,gb_s(size,copy_time)

    ,gb_s(size,copy_memcpy_time),gb_s(size,copy_float_time),gb_s

(size,copy_float2_time)

    ,gb_s(size,copy_float4_time),gb_s(size,copy_double_time),gb_

s(size,copy_double2_time));

}

[/codebox]

frankly, the bandwidth is disappointing! not only is the theoretical improvement over gtx280 so-so (384 instread of 512 bit width so no wonder),

the c2050 with its cranked-down clocks is a total miss. My factory overclocked gtx280s give only somewhat slower transfer rates than my gtx480, and a little

better than c2050… terrible for a wide class of bandwidth-bound applications.

buy gtx480, not c2050.

frankly, the bandwidth is disappointing! not only is the theoretical improvement over gtx280 so-so (384 instread of 512 bit width so no wonder),

the c2050 with its cranked-down clocks is a total miss. My factory overclocked gtx280s give only somewhat slower transfer rates than my gtx480, and a little

better than c2050… terrible for a wide class of bandwidth-bound applications.

buy gtx480, not c2050.

Previous experiences tells me that you will see better performance if you use fewer threads ( maybe ~64) and take care of many mem transfers per thread…

So try something like:

// ex 16384 elements per block AND 64 threads per block

#pragma unroll

for(int i = 0; i < 16384; i+=64)

{

  dst[threadIdx.x + blockIdx.x*16384 + i] = src[threadIdx.x + blockIdx.x*16384 + i];

}

I posted some reduction code before that achieved over 90% of peak bandwidth on Fermi cards using a similar approach.

Previous experiences tells me that you will see better performance if you use fewer threads ( maybe ~64) and take care of many mem transfers per thread…

So try something like:

// ex 16384 elements per block AND 64 threads per block

#pragma unroll

for(int i = 0; i < 16384; i+=64)

{

  dst[threadIdx.x + blockIdx.x*16384 + i] = src[threadIdx.x + blockIdx.x*16384 + i];

}

I posted some reduction code before that achieved over 90% of peak bandwidth on Fermi cards using a similar approach.