Tesla C2050 slower than GeForce 8800?

hi, fairly new to cuda programming. i have a linux box with 2 cards, a GeForce 8800 and a new Tesla 2050.

i ran the exact same code (vectorAdd example), with the linux time command, but the Tesla ran slower than the GeForce??? anyone have any idea why? anybody have the same experience? One thing i notice while grabbing the properties of the devices is that the CUDA ver for the Tesla card is 2.0 ??? i’m running Cuda 3.0 SDK… wonder if this has anything to do with it?

tia

Device Name - Tesla C2050 Vector addition
PASSED
Done
0.018u 1.365s 0:01.61 85.0% 0+0k 0+0io 0pf+0w

Device Name - GeForce 8800 Ultra Vector addition
PASSED
Done
0.012u 1.228s 0:01.46 84.2% 0+0k 0+0io 0pf+0w

Here’s the device properties:
Device Name - Tesla C2050


Total Global Memory -2751936 KB
Shared memory available per block - 48 KB
Number of registers per thread block - 32768
Warp size in threads - 32
Memory Pitch - 2147483647 bytes
Maximum threads per block - 1024
Maximum Thread Dimension (block) - 1024 1024 64
Maximum Thread Dimension (grid) - 65535 65535 1
Total constant memory - 65536 bytes
CUDA ver - 2.0
Clock rate - 1147000 KHz
Texture Alignment - 512 bytes
Device Overlap - Allowed
Number of Multi processors - 14

Device Name - GeForce 8800 Ultra


Total Global Memory -785728 KB
Shared memory available per block - 16 KB
Number of registers per thread block - 8192
Warp size in threads - 32
Memory Pitch - 2147483647 bytes
Maximum threads per block - 512
Maximum Thread Dimension (block) - 512 512 64
Maximum Thread Dimension (grid) - 65535 65535 1
Total constant memory - 65536 bytes
CUDA ver - 1.0
Clock rate - 1512000 KHz
Texture Alignment - 256 bytes
Device Overlap - Not Allowed

Try an example that does more expensive work and is not limited by PCIe speed.

Are both cards connected via PCIe 2.0 x16? If not, it may be worth exchanging them so the fastest device also sits in the fastest slot.

thanks. they are both PCIe 2.0 - 16. but looks like the mapping for the devices went screwy. so we yanked the geforce and rebooting and that took care of it. however, the time runs are still slower than with the geforce. we’re thinking the bog is in the bus and file-io.

i’m working on running a higher computational example.

I am also noticing a 5-6 times slowdown for the exact same code (mex wrapper that calls the SDK’s [i]DCT on an image). Performing a DCT with a GTX275 would take 0.2544 seconds (all included, matlab->mex->cuda->mex->matlab). With the exact same code and image, the C2050 now takes 1.6 seconds on average (!!).

The image is 2048x1536 converted to grey-scale.

I’ll check with the boss but I’d be glad to share the source-code with anyone interested.

Lastly, we might want to start a thread for such C2050 “slowdowns” as I am suspecting we’ll see quite a few pop-up as the hardware gets delivered…

Hi everybody,
I m actually facing the same problem between a tesla C2050 (448 cores) and GTX 260 (128 cores) even if the C2050 is faster with a speedup of 1.2 .
Does anyone found a solution for that problem.
i thought that it was the error cotrol ECC wich (i guess) should be desabled.
thanks for any answer,

kind regards

That roughly amounts to the ratio in memory bandwidth. If your kernel is memory bandwidth limited, this is the expected result.

hi tera,
Thanks very much for your answer.

In fact , in my program, i ve considered only one kernel where i have to update one matrix on the GPU (no communaction with the Hodt) with simple computation between shared memory and global memory.
This is why, i expected a speedup of 2 and may be more since Tesla cards ,"I think ", are dedicated to scientific computing.
I tried to increase the size of matrix but i still have the same speed up External Image

kind regards
Mouh

Hi,

I am relatively new to CUDA development. I have some issues with global memory speed (I am using NVIDIA Tesla C2050). I am achieving low speed reading from global memory, about 15-20GB/s (both according to my metrics and cuda profiler). My accesses are coalesced and aligned to 32 bytes and I am wondering what else might an issue for that (any configuration issues ?).

Thanks,
Eva

  1. make sure ECC is OFF. (NVIDIA CONTROL PANEL)
  2. make sure your 2050 is in TCC mode (use: nvidia-smi.exe --gpu=[2050_GPU_NUMBER] --driver-model=1)

Is TCC mode the same as ECC mode?

I also have a case where Tesla C2050 is slower than non-fermi (295GTX).

So far, I think, I have isolated the problem to PCI transfers. This is

especially galling since the Tesla is in a server box and bandwidthTest says

its more than twice as fast as the PCI in my old PC (with 295GTX).

A typical transfer is two arrays of 5780bytes. Both int. Both pinned.

The C2050 performance log says

I think memtransferhostmemtype=1 confirms CUDA is using nonpaged “pinned” memory.

If the PCI was running flat out each transfer would take 3.51 microseconds.

So I think the performance log is as expected. (May be the cputime is trying to tell me something?)

(The figures for the 295 GTX are:

)

cutilSafeCall( cudaThreadSynchronize() );

  cutilCheckError( cutResetTimer(hTimer) );

  cutilCheckError( cutStartTimer(hTimer) );

cutilSafeCall(cudaMemcpy(d_I,I,len*sizeof(int),cudaMemcpyHostToDevice));

  cutilSafeCall(cudaMemcpy(d_y,Y,len*sizeof(int),cudaMemcpyHostToDevice));

cutilSafeCall( cudaThreadSynchronize() );

  cutilCheckError(cutStopTimer(hTimer));

  const double gpuTimeUp = cutGetTimerValue(hTimer);

However using the above host code to measure the time as seen by the host gives gpuTimeUp=0.262 milliseconds when run on the fermi system but 0.026 when run on the PC (295GTX) system.

Both systems are running 64bit Centos (slightly diferent revs):

Linux 2.6.18-164.15.1.el5 (Tesla)

Linux 2.6.18-194.32.1.el5 (295)

In summary (I think) the Fermi system is running slower because an average I/O across the PCI

takes ten times as long even though the C2050 is attached (acrording to bandwidthTest) via a faster PCI.

Any help or guidence would be modet welcome

Bill

No.

ECC stands for “Error Correction Code” which basically means that each memory call is wrapped bu code that makes sure it is valid and correct it if it is not (for example due to errors created by heating of the memory chip). Naturally - this wrapping generates a performance hit.

TCC stands for “Tesla Compute Cluster”. In this mode the graphics device is not attached to a display and at least under windows 7 this means that it does not have to work under the WDDM (Windows Display Driver Model) which has various limitations.

I believe both ECC and TCC can be configured from the nvidia-smi tool (run nvidia-smi /? for more info).

another cause for slowdown could be the migration to Fermi architecture.

For example: if you have mul24 commands in your code, these will take about X16 MORE TIME on Fermi architecture since they are translated to multiple instructions instead of one as was the case before Fermi.

check out the “CUDA programming Best Practice Guide” for more information.

good luck!

I’m currently looking at transfers over the PCI.

The kernel times are a bit faster with the Tesla.

The two systems have different drivers 256.40 on the PC and 160.19.26 on the Tesla server.

Still hunting…

Will do. Nothing leaps out yet.

May be I have to give each kernel launched more work to do but that means restructing the code (again:-(

Thank you.

Bill

I had a look at The Official NVIDIA Forums | NVIDIA
It suggests the problem might lie with X-11.
In the Fermi/server box there are two Tesla C2050 and a Quadro NVS 290.
I think none of them are connected to X11.
Avidday suggested lack of X11 might be the problem and running nvidia-smi -l -i 10
to keep the driver active. However no luck with this yet.
Bill
ps: having nvidia-smi -l -i 10 running in the background does speed up nvidia-smi -a (3.5 secs → 0.010s)
but it does not speed up my program

seems to suggest the problem is PCI transfer latency and this may be a problem fixed by nvidia driver 270.18
Has anyone any experience of driver 270 with CUDA?
There seems multiple versions of 270.nn - maybe just go for the most recent?
Many thanks
Bill

I’m afraid I did not yet try driver 270.18

However the application has been recoded to reduce the number of kernel calls from 12225

to 225 (still doing the same work, just in bigger chunks) and now the C2050 is 60% faster

Thank you

Bill

Dr. W. B. Langdon,

    Department of Computer Science,

    University College London

    Gower Street, London WC1E 6BT, UK

    http://www.cs.ucl.ac.uk/staff/W.Langdon/

Debugging CUDA http://www.cs.ucl.ac.uk/staff/W.Langdon/ftp/papers/debug_cuda.pdf