Sorry for the beginner type question here but we recently started working with a company
using a Tesla S1070 - 500 (we had previously designed work on our Quadro FX 5600).
We were noticing a few things and were hoping you could answer the following question -
using profile 2.2 -
do we need to update our drivers/nvcc to 2.3? do we get a big performance boost here?
We notice the very first data transfer to the tesla runs an abnormally long 3000 ms or so -
is this a result of drivers? is this a result of using a double sided card and not initializing it properly?
is the card in an error mode and we don’t know how to tell? This is the biggest issue we’re having,
every subsequent data transfer to the gpu or back from the gpu takes about what we’d expect it to take.
Is there an easy check to make sure the card is still work? (that we didn’t send it into error modes or crash it
all together)
Our code currently runs fine if we use float, but numbers seem to become 0 if we switch to double,
isn’t this supposed to have double precision and if so has anyone encountered this kind of behavior/bug before?
Finally, if using -shared and we have to use -Xcompiler -fPIC on our machine as well, will our .so have any linking problems?
(I found a thread on this else where and will examine this as well but figured I’d just post all my questions in one thread -
I didn’t find the answer to these other ones through search so please excuse me if the answers already exist) (and if it helps,
can one use nvcc on a machine that doesn’t have the card simply to compile the .cu sources with a large number of other c libraries
and files then transfer the executable to the machine with the tesla card?)
Thanks very much in advance - we’ve been struggling to find these answers (even from searching on here).
I have noticed this as well and I believe it is due to the driver/runtime/whatever being initiated for the first time. I only have experience running under a headless linux machine so results may very depending on your setup.
Good to know most seem to think this is context set up
This is currently what I do as well
No I was not! -arch sm_13 solved it, thanks, somehow I missed that!
You are correct, my apologies, the first cuda malloc call is probably causing the issue (I simply have a timer wrapped around both).
The amount of data varies and the delay stays roughly the same.
We are talking about an array of floats that has ~500 values vs. an array of floats that has about 400,000 values (or doubles in a different version).
If it is in fact context set up that is fine and as others have said they experience delays too, and this is ok with us, but if they are really seeing delays only in the 80-100 ms and we are well over 1000ms that makes me think there is another problem? (I’ve adjusted my previous estimate of 3000 ms to roughly 1500-2000 ms - still considerably larger)
Pinned Memory and DMA are not things I’d thought of - this is a great suggestion thanks.