predication in inline PTX

Is it possible to have a predicated load to shared memory?
Is it possible to do it from C?
If not, how would one code it up using inline PTX?

Why can’t you do this in C? An “if (flag) x = shared[i];” should do what you want.

If you really want to use PTX, you should create (and use) all predicates within a PTX assembly clause.

Example of declaring and using a predicate:

Alternatively, if it’s possible, you can create a boolean flag and enclose your PTX assembly in an “if(flag)” clause. Then doublecheck the generated PTX/SASS.

Why is it important that the load is predicated? Last I checked, load instructions including texture loads are predicateable. You can create predicated load instructions in PTX inline assembly.

The compiler will frequently use predication as an optimization, for example by if-conversion. So you might want to look at a simple approach first, and check the generated machine code with cuobjdump --dump-sass:

if (condition) local_variable = shared_memory[index];

Depending on code context, predication may not be indicated for best performance, and the compiler may use some other code idiom.

From what I understand, an instruction that is predicated “false” still issues. For a non-memory instruction the execution proceeds normally but register write-back is inhibited. For a memory instruction execution proceeds until the address computation stage and the instruction is discarded at that point, so a load/store that is predicated false does not create an out-of-bounds exception should the associated address be out-of-bounds.

[Later:] Here is an example where the compiler generates predicated shared memory loads and stores.

__global__ void kernel (float a, float b, float *res)
    __shared__ float t;
    float r;

    if (threadIdx.x == 0) {
        t = 4.0f * a;
    if (threadIdx.x != 0) {
        r = t;
    } else {
        r = 5.0f * b;
    *res = r;

The generated machine code, using compilation with nvcc from CUDA 6.5 with default settings is:

code for sm_20
                Function : _Z6kernelffPf
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];            /* 0x2800440400005de4 */
        /*0008*/         S2R R0, SR_TID.X;                 /* 0x2c00000084001c04 */
        /*0010*/         MOV R4, c[0x0][0x28];             /* 0x28004000a0011de4 */
        /*0018*/         MOV R5, c[0x0][0x2c];             /* 0x28004000b0015de4 */
        /*0020*/         ISETP.EQ.AND P0, PT, R0, RZ, PT;  /* 0x190e0000fc01dc23 */
        /*0028*/    @!P0 LDS R0, [RZ];                     /* 0xc100000003f02085 */
        /*0030*/    @!P0 BRA.U 0x50;                       /* 0x400000006000a1e7 */
        /*0038*/     @P0 MOV R0, c[0x0][0x20];             /* 0x28004000800001e4 */
        /*0040*/     @P0 FMUL R0, R0, 4;                   /* 0x5800d02000000000 */
        /*0048*/     @P0 STS [RZ], R0;                     /* 0xc900000003f00085 */
        /*0050*/         MOV R2, c[0x0][0x24];             /* 0x2800400090009de4 */
        /*0058*/         FMUL R2, R2, 5;                   /* 0x5800d02800209c00 */
        /*0060*/         SEL R0, R2, R0, P0;               /* 0x2000000000201c04 */
        /*0068*/         ST.E [R4], R0;                    /* 0x9400000000401c85 */
        /*0070*/         EXIT;                             /* 0x8000000000001de7 */

Note the predicated LDS (load shared) instruction at address 0x28 and the predicated STS (store shared) instruction at address 0x48.