Hi everyone.
I’m new to this forum and this is my first post. I’m a C / x86 programmer of 20 years, and I’ve read as much of the NVidia CUDA docs as I could get my hands on, particularly the PTX ISA 2.1 documentation. It’s very exciting stuff.
However, having scoured the documentation, I still find myself with some fairly basic questions, most, if not all of which have to do with what the docs call “Parallel Synchronization and Communication”. Two questions which stand out at the moment are:
-
Can the PTX “bar.red.op.pred” instruction specify a thread count of zero? I ask this because if it is legal to do this, would that then make it behave like the “bar.red” equivalent of the “bar.arrive” instruction?
-
More generally, I wish to find a way for a Kernel to be able to suspend its own execution until a global memory variable has reached a certain value. Has anyone ever needed to do this, and/or how could this be done? Does anyone know if this is even possible?
In terms of C, I’m looking for a way to encode a “void Wait4ValGE( unsigned *Global, unsigned Val )” function that will suspend execution of the thread until *Global >= Val, as in the following C psuedo-code:
#define N 4096
uchar Ready[ N ]; // <== global to all threads…
void Calc_Var( unsigned n, unsigned m ) // <== a callable function…
{
Wait_For_GE( &Ready[ n ], m ); // <== would suspend execution until Ready[ n ] >= m…
Do_Some_Stuff( n, m );
++Ready[ n ]; // <== this could also include extra sync code to support Wait_For_GE()…
}
void Kernel_Func( ushort ThreadID )
{
ushort n = 0, m = ThreadID;
do
{ Calc_Something( n, m );
Calc_Var( n++, m++ );
}
while ( m < N );
}
I realize the above is not in the correct CUDA C/C++ format - it’s more just a conceptual way of posing the question. Obviously, the “Kernel_Func” would be implemented as a 1D CUDA Kernel thread which, in this case, would (hopefully) have N (4096) instantiations.
My current thinking is that, assuming a local predicate variable, “P” in function Calc_Var(), a “bar.red.or.pred P, 0, 0, 1” instruction could be inserted immediately after the “++Ready[ n ]” (assuming that’s even possible - see question 1), which would signify an “arrival” at “barrier” 0, then the Wait_For_GE() function could loop a “bar.red.or.pred P, 0, WARP_SZ, 0” instruction, breaking out of the loop only when P is true (i.e. ++Ready[n] has been executed), AND Ready[ n ] >= m.
Originally, my idea was to have the Wait_For_GE() function looping the “bar.red.or.pred” instruction with a thread count of one. But the PTX docs specifically say:
“Since barriers are executed on a per-warp basis, the optional thread count must be a multiple of the warp size.”
So the thread count must be WARP_SZ.
Basically though, the above idea sucks in terms of the chances that it could work, so I’m looking for a better way to do this.
Any ideas anyone?