P1000 vs T1000, same driver, different behaviour

I have programmed an iterative solver using the BiCGSTAB and PBiCGSTAB methods using CUDA. Given a dataset, this code will sometimes converge, and sometimes it won’t. Different behaviour beween runs, same dataset. And when it does converge, the time and number of iterations it takes for the solver to converge will vary drastically (± 50% on both times and iterations).

This holds true when running on a P1000 and 3070Ti (Mobile), but NOT on a T1000 (8GB), which always requires the EXACT same number of iterations to find a solution. When it does converge for the other two GPUs, the solutions are the same for all three GPUs. The T1000 and the P1000 even run on the same driver. The T1000 we are using does not have ECC (I believe some support that feature).

There are no random values in the code, and I see perfect reproducibility across all platforms when running my CG numerical methods, which tells me it has something to do with the numerical instabilities inherent to the BiCGSTAB method. What I don’t understand, is why this only affects one GPU.

Can someone explain to me why this is happening? I understand that BiCGSTAB is not guaranteed to converge, but why would it converge on one run, and not the next? I also understand CUDA does not guarantee reproducibility, but why then is it fully reproducable (even when using two different PC’s) on the T1000? Does the T1000’s architecture offer different features which reduce numerical errors? I have been unable to find anything like that in any publicly available datasheets.

some possibilities:

  1. a race condition in your code
  2. use of atomics or anything else susceptible to varying ordering
  3. (of course, defects in CUDA, the GPU, or other system issues are always a possibility)

I never heard that statement, until now. Perhaps you mean like with atomics or anything where cuda does not guarantee execution order, like the order of execution of threads, or warps, or threadblocks. If ordering of such is important, you either need to redesign to an algorithm that is independent of such ordering, or else provide explicit ordering in your code.

I do not use atomics in these methods. I either use built-in NVIDIA functions, or kernels which use entirely independent memory addresses where no atomic operations occur. Order of operations shouldn’t be an issue with this method.

I just went to find the statement about reproducibility in the documentation. I see now I misread it, and the statement was about reproduceability across toolkit versions. Does that mean CUDA does guarantee reproducibility?

And do you have any information on what might be unique about the T1000?

When multiple atomicAdd() statements are run by separate threads, there is no guarantee of ordering amongst those threads that are doing the atomicAdd, provided by the atomicAdd itself. That is, suppose a float atomic location in memory contains the value of 0. Now suppose there are 1 million threads that want to do an atomicAdd of 1 to that location, and one thread that wants to do an atomic add of 32 million to that location.

The atomicAdd guarantees that only one thread at a time will do its atomic update (add). It does not guarantee the order of operations (which thread goes first, or second, or third…) among those threads. So if all the threads that want to add 1 happen to run first, the final value in the atomic location will be 33 million. however, if the thread that wants to add 32 million goes first, the final value in that atomic location will be 32 million.

This is due to characteristics of floating-point arithmetic. It is not unique or specific to CUDA, at least as far as the issue being observed based on order of operations goes.

When you have multiple atomic updates to the same location (at the same time), CUDA provides no guarantees of the order in which those updates will be performed.

Such a blanket statement is not something I would address directly. We’ve already seen that code, and probably other factors, matter. I would say that if we look at the results of a single thread in CUDA, only, I would generally expect reproducibility, run-to-run, for the same initial conditions, barring the usage of such obvious disqualifiers as using uninitialized memory, or invoking UB in some other fashion.

Well, you’re assuming its unique because you have compared it to one two other GPUs (from what I can see here). It might not be unique in a larger test. Anyway, what makes T1000 obviously different for things like order of execution of threads, warps, and blocks (which can lead to different order of operations for atomics, for example) is the fact that the number of SMs (and probably other related factors, like maximum threads per SM) in a T1000 is different than many other GPUs, like the P1000.

I just checked my code, and it turns out I no longer use the atomic functions. The only atomic functions which could be left in my code are those in the native-CUDA functions, but I imagine those are written by far more intelligent people than I to avoid race conditions. Can the race conditions you mentioned with the floating-point arithmetic occur within the native-CUDA functions as well?

I wouldn’t be able to pinpoint what may be happening in your code. Just commenting based on my experience.

I understand, I revised my reply. Can you tell me if the race conditions you mentioned with the floating-point arithmetic can occur within the native-CUDA functions as well? Or will my code be “safe” so to say, if I restrict myself to the built-in functions?

race conditions occur in the context of multiple threads executing. Some native CUDA functions involve multiple threads (e.g. cooperative groups), some do not (e.g. CUDA Math API). I wouldn’t try to answer a question like that in a vacuum. I have seen people pull functions out of CUDA implementation details that are clearly not documented or intended for general use; I don’t know what you are referring to exactly.

I would not expect the possibility of a race condition solely from use of a math api function. Nor would I expect the possibility of race conditions from proper use of cooperative groups, up to the warp level, at least.

If so, the most likely cause of the discrepancies observed is a race condition, use of uninitialized data, or out-of-bounds access somewhere in the code, most likely your code. This includes both host and device code. I would suggest use of compute-sanitizer as a first-line check of the device code, valgrind or similar tool for host code. Note that tools cannot find all possible issues, but they can find a good portion of them. You may need to instrument your code, stop at the first discrepancy observed, and work your way backwards from there.

There is a non-zero chance that there are minor differences in device function intrinsics, such as __sqrtf(),__expf(), __sinf(), between GPU architectures. To my knowledge NVIDIA only specifies a particular accuracy level for these, but makes no specific claim of bit-wise identical results across GPU architectures. However, I would consider the likelihood of such differences tiny; by observation the functionality of the special function unit of the GPU appears to be carried forward unchanged between GPU generations.

There is also a non-zero chance that the architecture-specific optimizer in the CUDA compiler re-associates floating-point expressions differently for different GPU architectures. Since floating-point arithmetic is not associative, this could cause numerical results to differ. The CUDA compiler is conservative when it comes to evaluating floating-point expressions: pretty much the only re-association it applies is to merge FMUL and dependent FADD into FMA (fused multiply-add). I cannot recall a case of merging differences between GPU architectures (say for a*b+c*d, where two different merges are possible), so I consider the likelihood tiny. FMA-merging can be turned off by programmers with the command line switch --fmad=false. Note that this can adversely affect performance and overall accuracy of computations.

In the absence of bugs and the use of atomics, and when sticking to the same hardware using the same toolchain with the same compilation switches for the same source code, floating-point results in CUDA code should be entirely reproducible.

compute sanitizer also has a number of sub-tools that are useful for this sort of investigation, including synccheck, racecheck, and initcheck. Note that racecheck is not a universal race condition “finder”. It identifies race conditions associated with hazardous use of shared memory.

cuda debugging is the topic of unit 12 of this online tutorial series.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.