is syncthreads needed when will divergent threads in same warp re-sync

Can I avoid syncronising all threads when threads in the same warp diverge?
example
if(threadid & 31 == 0) {
do stuff, including setting shared data, eg
shared[threadid/32] = calculated value;
}
all threads now try and use shared[threadid/32]
note this is a different index for each warp.
Do we need to explicitly tell all threads to wait
until all copies of shared have been updated?
Or will hardware force divergent threads to resync
before any of them try and use their warp’s index
into array shared?

                        Thank you

                            Bill

    Dr. W. B. Langdon,
    Department of Computer Science,
    University College London
    Gower Street, London WC1E 6BT, UK
    http://www.cs.ucl.ac.uk/staff/W.Langdon/

CIGPU 2012 http://www.cs.ucl.ac.uk/staff/W.Langdon/cigpu
EvoPAR 2012 http://www.cs.ucl.ac.uk/staff/W.Langdon/evopar
EuroGP 2012 30 Nov
RNAnet http://bioinformatics.essex.ac.uk/users/wlangdon/rnanet/
A Field Guide to Genetic Programming
http://www.gp-field-guide.org.uk/
GP EM http://www.springer.com/10710
GP Bibliography http://www.cs.bham.ac.uk/~wbl/biblio/

Yes you can.

Note that you will need to declare [font=“Courier New”]shared[/font] as volatile though, at least on Fermi.

As others have said you can do this for warp-level synchronization though there is a big caveat with this. The caveat is that this is not part of the programming model and thus your code may not be compliant in the future. Thus if you upgrade to a new version of cuda in the future there is a possibility that this code could break. Just keep note of that as it might save you a lot of work in the future.

Warp-level synchronization is documented in the CUDA Programming Guide (though I had to hunt to find it). In Section 5.4.3, they state “Because a warp executes one common instruction at a time, threads within a warp are implicitly synchronized and this can sometimes be used to omit __syncthreads() for better performance.” I think it is safe to assume that this is not just an accidental feature.

However, the size of a warp is variable, in principle. Although all current CUDA devices use 32, Ocelot can use different warp sizes when executing on non-CUDA devices, and future CUDA devices might also have different sized warps. (There should be an over-under betting pool at SC11 on which direction NVIDIA will change the warp size. I can imagine arguments for both 16 and 64.)

32 seems like the ideal warp size to me. It is most convenient for intra-warp instructions like __ballot: that returns a bit for each thread in a single register. Makes sense to me to keep the warp size equal to the register size.

I would have thought so too, but I assumed that when they slipped the warpSize constant into CUDA 2.0 (or was it earlier?), they were preparing for some future change in their internal roadmap. Or maybe they just wanted to encourage people to be explicit in case CUDA was ported to architectures with narrower SIMD capabilities…

A subtle point that worries me is if the compiler will have to enforce/respect that. Here’s an example:

__shared__ volatile int s;

__global__ void inc()

{

        for(int i = 0; i < 2; i++) {

                bool p1 = threadIdx.x == 0;

                bool p2 = i == 0;

                if(p1 && p2 || !p1 && !p2) s++;

                // __atomatic_sync_within_warp();

        }

}

int main()

{

        inc <<<1, 32>>>();

}

If the warp sync happens as indicated, s will be incremented by more than 1 (but undefined how much.)

HOWEVER, suppose the compiler optimizes the code, by unrolling the loop, and then recognizing that for any given thread the increment happens EXACTLY ONCE. In that case, it may wipe out the entire unrolled loop and the conditional, causing the entire warp to NEVER diverge. Then, s would be incremented EXACTLY by 1 and no more.

This sort of thing worries me and makes it difficult to rely on this feature.

For now, cuobjdump gives the following machine instructions for this code:

code for sm_20

                Function : _Z3incv

        /*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];

        /*0008*/     /*0x84001c042c000000*/     S2R R0, SR_Tid_X;

        /*0010*/     /*0xfc01dc03190e0000*/     ISETP.EQ.U32.AND P0, pt, R0, RZ, pt;

        /*0018*/     /*0x6000a1e740000000*/     @!P0 BRA.U 0x38;

        /*0020*/     /*0x03f08085c1000000*/     @P0 LDS R2, [0x0];

        /*0028*/     /*0x042080034800c000*/     @P0 IADD R2, R2, 0x1;

        /*0030*/     /*0x03f08085c9000000*/     @P0 STS [0x0], R2;

        /*0038*/     /*0xfc01dc03190e0000*/     ISETP.EQ.U32.AND P0, pt, R0, RZ, pt;

        /*0040*/     /*0x000001e780000000*/     @P0 EXIT;

        /*0048*/     /*0x03f01c85c1000000*/     LDS R0, [0x0];

        /*0050*/     /*0x04001c034800c000*/     IADD R0, R0, 0x1;

        /*0058*/     /*0x03f01c85c9000000*/     STS [0x0], R0;

        /*0060*/     /*0x00001de780000000*/     EXIT;

By the way, I’m not crazy! I’m asking this question because like others, I need this warp_sync feature to work and be documented clearly.

I guess one might insert some redundancy into the loop using [volatile asm("…":::“memory”)], hence giving the compiler an artificial sync point.