Is it possible to overlap memory access and computation inside the same kernel?

I’m trying to optimize kernels which are using some values from global memory, and i have code with following idea:


__global__ void k_test_overlap(float* data) 
{
// getting values 
float f1= data[0];
float f2 = data[1]; // first version

//=================================
//long computation code using only f1
//=================================

#pragma unroll
for (int  i = 0 ; i<32; i++)
{
f1 = sinf(f1+1.0f);
}
//=================================
//long computation code using only f2
//=================================

data[0] = f1;
//float f2 = data[1]; // second version

#pragma unroll
for (int  i = 0 ; i<32; i++)
{
f2 = sinf(f2+1.0f);
}

data[1] = f2; 

}

I want to cache values long time before computation in first version and right before comutation in second. My question is: can be calculation using f1 overlapped with data transfer of f2 from global memory? Can be write of f1 to global memory be overlapped with calculation for f2? Does this idea work well when transferring data from/to shared memory to/from registers or global memory?

You might be imagining the C++ code you write is an accurate representation of what the machine will do at a detailed level, in time.

It is not.

The C++ code goes through several phases of optimizing compilation, and the compiler understands that the underlying machine is a latency-hiding machine that likes to “overlap” copy and compute. It likes to get loads from global memory “in flight” as early as possible, because those operations are “fire and forget” and they often result in a dependency situation later in your code. So the example you have shown with the load of f2 moved “up” in your code is something the compiler may do, even if you don’t. Here is a more detailed example of this type of behavior. Spending a lot of time trying to order C++ code for this objective may not be the best usage of your time. The compiler is often able to discover such things, and may make the reordering if it believes there will or could be a performance benefit.

In Ampere architecture GPUs, (technically: volta and forward), an additional facility is provided to allow programmers to specify asynchronous loading of shared memory from global memory. You can read about it here. This is explicitly telling the compiler and the machine “this load of shared memory can be overlapped with subsequent code, until the point in time that I declare a synchronization”.

If you really want to be confident of understanding the ordering of operations, its necessary to study the SASS code. It’s difficult to make statements that are trustworthy based on C++ source code.

I understand that the compiler and optimizer will make their own adjustments to my code and change its order. I probably did not quite clearly formulate the question, I meant is it possible to copy/transfer data inside one thread of one kernel in parallel with some others calculations that do not depend on these data?

Yes, that happens automatically in a CUDA GPU (because nearly all operations have latency). You might have a sequence like this:

SASS:                               C++:
1 LDG   R0, [R2]
2 FMUL  R6, R4, R5                   c = a*b;
3                                    int w = g_w[threadIdx.x];

The C++ code in line 3 caused the compiler to schedule the SASS instruction #1. The C++ code in line 2 caused the compiler to schedule the SASS instruction in line 2 (Presumably those variables were already loaded into registers.)

The load operation associated with SASS line 1/C++ line 3, is happening “in parallel” (i.e. at the same time) that the multiply in line 2 is taking place. The load operation started in cycle 1 has a latency of ~100 cycles, so it is running at the same time as the multiply operation that started in cycle 2. The multiply operation that started in cycle 2 may have a latency of ~6 cycles, so the register R6 could not be used in a new operation perhaps until cycle 8. The load operation begun in cycle 1 has a latency of ~100 cycles, so the register R0 could not be used until cycle 101. (latencies are approximate, just using those values for the sake of discussion).

If you are asking is there a way to do this explicitly in a single thread (i.e. identified as such at the source code level), the only method I know of is the aforementioned async shared memory load that is “new” in Ampere.

1 Like

At the SASS level, if you use “cuobjdump” to get a listing, entries between curly braces indicate dual issued instructions:

        /*0268*/         {         BFE.U32 R15, R4.reuse, 0x810 ;                   /* 0x380000008107040f */
        /*0270*/                   LDS.U.U16 R30, [0x230]         }
                                                                                    /* 0xef4a10002307ff1e */
        /*0278*/         {         BFE.U32 R0, R4.reuse, 0x808 ;                    /* 0x3800000080870400 */
                                                                                    /* 0x001c4800fe0007f4 */
        /*0288*/                   SSY 0x2270         }
                                                                                    /* 0xe2900001fe000000 */
        /*0290*/         {         LOP32I.AND R3, R4, 0xff ;                        /* 0x040000000ff70403 */
        /*0298*/                   LDC.U8 R59, c[0x3][R0]         }

At least this is the case with Pascal and Cuda 10.2.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.