Performance Issues with new Tesla Card Not getting best performance? Ways to check?

Sorry for the beginner type question here but we recently started working with a company
using a Tesla S1070 - 500 (we had previously designed work on our Quadro FX 5600).

We were noticing a few things and were hoping you could answer the following question -
using profile 2.2 -

  • do we need to update our drivers/nvcc to 2.3? do we get a big performance boost here?
  • We notice the very first data transfer to the tesla runs an abnormally long 3000 ms or so -
    is this a result of drivers? is this a result of using a double sided card and not initializing it properly?
    is the card in an error mode and we don’t know how to tell? This is the biggest issue we’re having,
    every subsequent data transfer to the gpu or back from the gpu takes about what we’d expect it to take.
  • Is there an easy check to make sure the card is still work? (that we didn’t send it into error modes or crash it
    all together)
  • Our code currently runs fine if we use float, but numbers seem to become 0 if we switch to double,
    isn’t this supposed to have double precision and if so has anyone encountered this kind of behavior/bug before?
  • Finally, if using -shared and we have to use -Xcompiler -fPIC on our machine as well, will our .so have any linking problems?
    (I found a thread on this else where and will examine this as well but figured I’d just post all my questions in one thread -
    I didn’t find the answer to these other ones through search so please excuse me if the answers already exist) (and if it helps,
    can one use nvcc on a machine that doesn’t have the card simply to compile the .cu sources with a large number of other c libraries
    and files then transfer the executable to the machine with the tesla card?)

Thanks very much in advance - we’ve been struggling to find these answers (even from searching on here).

I’ve seen my code get ~20% boost going from 2.0 to 2.2 and another small boost moving to 2.3

Thats probably the setting up of the context.

I usually just run one of the SDK samples to see everything is fine and the GPU is working properly.

Are you sure you’re using the appropriate -sm_arch 13 compiler flag ? Obviously doubles work on the GPU :)


I have noticed this as well and I believe it is due to the driver/runtime/whatever being initiated for the first time. I only have experience running under a headless linux machine so results may very depending on your setup.

Dr. Dobb has some information that may be useful.

Even before memcpy, you must have done cudaMalloc() – So, your cudaMalloc() must have taken up the initialization hit.

In any case, the advertised initial delay is only about 20ms or so… Even if u multiply for all the 4 cards in question, it is about 80 ms…

How much data are you copying ?

Try using pinned memory and make the card DMA the data… THat will CONSIDERABLY speedup memcpy…

@Adam Simpson - Thanks for the link, very helpful

Ok - I will update my version soon

Good to know most seem to think this is context set up

This is currently what I do as well

No I was not! -arch sm_13 solved it, thanks, somehow I missed that!

You are correct, my apologies, the first cuda malloc call is probably causing the issue (I simply have a timer wrapped around both).

The amount of data varies and the delay stays roughly the same.

We are talking about an array of floats that has ~500 values vs. an array of floats that has about 400,000 values (or doubles in a different version).

If it is in fact context set up that is fine and as others have said they experience delays too, and this is ok with us, but if they are really seeing delays only in the 80-100 ms and we are well over 1000ms that makes me think there is another problem? (I’ve adjusted my previous estimate of 3000 ms to roughly 1500-2000 ms - still considerably larger)

Pinned Memory and DMA are not things I’d thought of - this is a great suggestion thanks.

Thanks to everyone for their help!

If you run nvidia-smi in a loop, the startup time will be faster.