I have 2 GTX280s doing some math calculations. When I run code on one of the GTX280s I get certain results. When I run the exact same code on the other GTX280 I get different values for a small number of results ( < 50 out of 35000). I was wondering if anyone else has experienced something similar.
I found I got different results on my same card when I overclocked it. But it wasn’t a hardware error! It was a race condition in my global memory writes, and different clock speeds made the race winners change.
You could diagnose your problem by comparing outputs vs the emulator, then cutting down your example to be as simple as possible. It certainly could be a bad board, but it’ll take some experiments to tell.
As an aside, it’d be cool if someone made a memory and execution tester in CUDA… kind of like memtest86 and Prime95 are used for CPUs.
If you are looking @ any commercial angle – YOu should always go only for TESLA! – They are the ones certified for computation! (NVIDIA guys – correct me if I am wrong here)
And, Yes, it would be a great idea to have a memtest tool! If some1 already has it – pass it on please!!!
There should be a way to MMAP the whole graphics memory into a windows application (except the one currently used for display) and test the same!!! It would be slow! But it would be definitely helpful!
That might have something to do with it…the 2 cards are running at different clock speeds. However, I changed threads/block from 512 to 256 and results are consistent and correct across both cards now. Weird.
Have you tried comparing the outputs of your two cards with the output of the “emu” device? since that one is way more tested for computation than your cards (obviously) you would be able to know what the error rate is.
The problem with emu mode is that math operations are in double precision, as far as I can tell. I’ve found it quite rare that emu mode will return the same result as running on the card.
it won’t if you actually use single-precision arithmetic in your code–e.g., you append your constants with f. there’s no reason (short of MADs causing different results) that the card and emulation should return different results. if they do return different results and you ever compile with -arch sm_13, your performance will tank because it will probably start using DP on the card.
Yeah, I was wrong. Plus the MAD issue, here are some other reasons why execution on the card and -deviceemu will differ:
If x87 registers are used for storage of intermediate computations, those computations will use double precision, regardless of whether single-precision variables are used.
FDIV and FSQRT on the GPU are not IEEE-754 compliant for single precision.
Transcendentals return different values (some transcendental functions are mapped to double-precision versions in deviceemu).
Toolchain differences result in different orders of operations (thanks to different optimizations), which can cause different results.
Most of these should have occurred to me and I feel dumb for missing them–oh well.
The most important of all – Deviceemu does NOT emulate the hardware fully!! For example - in deviceemu there is no concept of warps! This can change a result to great extent!
None of my code will yeild correct results in deviceeemu – not because of precision problem but because of the in-correct emulation!
This is the most important point regarding device emulation. Rest all come next!
Consider only 1 WARP doing global_mem++; The global memory value will just be increased only by 1. In case of device-emu, it iwll be increased by 32 and so on…
i am having similar probelm. the only difference is that i run the same code on the same host and device. but each time i run it, it gives me different results, sometimes the correct one and some times wrong.
the code does some computations on a matrix, when i choose the input a matrix of a small size(upto 127), gives me the right answere but when i choose larger matrices like 255 it gives me different result.
can anyone please help me??