I must admit I’ve never even heard of these memory consistency models before now (I’m guessing they apply more to distributed programming, than to parallel programming) - but I’d say CUDA is a mix of a weak memory consistency model and a sequential consitency model depending on what type of memory you’re referring to (shared, constant, texture, and global memory all behave differently, have different rules, and in some cases can or cannot be guaranteed to be visible to other threads and/or blocks and/or kernels depending on various circumstances).
Honestly though, it sounds like you haven’t read the CUDA programming guide - which will answer all of your questions, and more.
Thanks for your answer. I have been reading the CUDA programming guide and could not find an answer… that is why I decided to post this topic.
Maybe I should have put an example:
Imagine you have 2 variables x and y that are initially 0, then:
Thread1 execute: x=1; a=y; (written in that order)
Thread2 execute: y=1; b=x; (written in that order)
In a sequentially consistent memory model, after you execute the code, it would be imposible to find that a and b are both equal to 0. You may get one of the following after the execution:
The last case would mean that Thread1 and Thread2 executed the code at exactly the same time (completely in parallel )
This would be a strong memory consistency model.
On the other hand, there are some systems were it is possible to find that a and b are both equal to 0… this is because the compiler may change the order of execution (a=y; x=1; in thread1, for example) as an optimization, or because the hardware scheduler decided to do the same change to speedup the execution… anyway, this would be a weak memory consistency model.
My question is which model NVIDIA-GPU/CUDA support? strong or weak?
I believe this is the reason for __threadfence(). I think in general the model is weak, but __threadfence() can be used to enforce ordering when you need it.
So for example
Thread1 execute: x=1; __threadfence(); a=y;
Thread2 execute: y=1; __threadfence(); b=x;
Then a=y must occur after x=1 has been flushed to memory and is visible to other threads, and likewise b=x must occur after y=1 is visible to other threads. There is no guarantee as to which will occur first, but at least one must write before the other reads it. Assuming appropriate use of volatile.
I believe that CUDA has strong consistency within a given warp, and weak overall. Thus, if thread1 and thread2 are in the same warp, then a and b will be equal to 1. If the threads are in different warps, all bets are off.
Of course, there’s no guarantee that the compiler won’t mess it up in either case.