Can one warp be doing one thing while another warp does something else?

Hello again,

Is it good practice to have something like this?

if (threadIdx.x < 32) {
  <Do task A>
}
if (threadIdx.x >= 32 && threadIdx.x < 64) {
  <Do task B>
}
__syncthreads();

<Do task C>

I have a situation where task A and task B will be similar in length (both require one coalesced global memory read, then a few mults or adds on the data, and a store into shared), but would it be helpful to have them on separate warps (both warps will then participate equally in the final task C).

Cheers!

Yes, warp specialization is a known strategy for GPU development. (google “gpu warp specialization”) Warp specialization by itself does not introduce any warp divergence, since the decision boundaries are at warp boundaries.

Whether or not it can be called a “good practice” depends on what you are trying to do, and whether your work packetizes well along warps or not.

Thanks, txbob! My specific application of warp specialization is simply to read different arrays from global into shared. My feeling is that the bandwidth from global to the SM is more than what 32 threads can read, even if their reads coalesce. Not sure if 240 threads would saturate the pipe, but I figure since I’ve got 768 threads in the block I’ll give it a shot. It may indeed be a SLIGHT improvement to do the warp specialization rather than have threads 0-239 handle the batch loading, but I’d have to do more timings to be sure. Here’s the code:

int atomid = (threadIdx.x & BATCH_BITS_MASK);
    int pos1 = pos + atomid;

    // Read coordinates and charges.  This attempts to make use of warp specialization.
    if ((pos1 < cSim.atoms) && (atomid < BATCHSIZE)) {
      if ((threadIdx.x >> BATCH_BITS) == 0) {
        PMEFloat charge = cSim.pAtomChargeSP[pos1];
        PMEFloat fx = cSim.pFractX[pos1];
        crdq[              atomid] = fx;
        crdq[3*BATCHSIZE + atomid] = charge;
      }
      if ((threadIdx.x >> BATCH_BITS) == 1) {
        PMEFloat fy = cSim.pFractY[pos1];
        crdq[  BATCHSIZE + atomid] = fy;
      }
      if ((threadIdx.x >> BATCH_BITS) == 2) {
        PMEFloat fz = cSim.pFractZ[pos1];
        crdq[2*BATCHSIZE + atomid] = fz;
      }
    }
    __syncthreads();

    <Do task C>

In this case, BATCHSIZE is 240, BATCH_BITS_MASK is 255, and BATCH_BITS is, as you might expect, 8.

There’s a couple things that jump out at me.

  1. If you’re gunning for every last ounce of performance, I would prefer integer add/subtract over integer shift, all other things being equal. (for many architectures, add/subtract has 2x higher throughput). There’s other integer add usage here (index calculation) so I’m not 100% sure what is best - its hard to evaluate integer pipe pressure without looking at the SASS, but it might be that this idiom for warp selection:
int atomid = (threadIdx.x & BATCH_BITS_MASK);
    int pos1 = pos + atomid;
    int batch_sel = threadIdx.x - atomid;
    // Read coordinates and charges.  This attempts to make use of warp specialization.
    if ((pos1 < cSim.atoms) && (atomid < BATCHSIZE)) {
      if (batch_sel == 0) {
       ...
      }
      if (batch_sel == 0x100) {
       ...

is faster. Or you could even try batch_sel = threadIdx.x & 0x0400; to use different pipes (which are also higher throughput than shift), if integer add pipe pressure is an issue.

  1. It seems to me like you might have a race condition here. Don’t you have 2 different warps writing to the locations crdq[240…255] (as one example)? Doesn’t the order of writes matter for correctness?

Point 1.) is interesting, and I had been wondering if the concept of “integer evaluation pipes” versus “floating point evaluation pipes” or other types of operations had anything to it that I could handle. I did take the integer shift out of the three conditionals, doing it once and then re-using the result as per your example. However, there shouldn’t be any race condition here because each of the three batch_sel groups is writing to a different segment of crdq. The arrangement is as follows:

[ 240 Cartesian X locations ][ 240 Cart. Y locs ][ 240 Cart. Z locs ][ 240 charge values ]

Before going into the load, (atomid < BATCHSIZE) establishes that only threads 0…239, 256…495, and 512…751 are active, so there’s no writing to anything smelling of 240…255. The reason for 240 (and not 255) is that I do a bunch of calculations on 240 atoms, then have warps work on one atom at a time. The loading of crdq is the smallest part of the calculation (which, thanks to warp specialization, is still using threads at pretty high occupancy), the calculations that happen based on data in crdq is perhaps 30% of the work, and what then happens based on those results is the lion’s share of the kernel. I could have opted to have BATCH_SIZE = 256, but then I’d have 24 warps trying to divide up 256 atoms, which leaves eight warps idle on the 11th pass, for an average of 94% occupancy. I opted for 94% occupancy in the second stage of the kernel for 100% occupancy in the final stage.

I am, as you have inferred, going for every last ounce of performance here. Reason being that if I can nail this, we can change gears with how the whole program runs. It’s pmemd.cuda, as njuffa and I have been discussing elsewhere, so if you’re familiar with the code, I’d be happy to share more details. Norbert also mentioned on the other thread that larger numbers of smaller blocks tends to be better. I can shift BATCHSIZE down to 60 with 192 threads, BATCH_BITS == 6, and see how that works. Maybe worth a try?

  1. your checks will be faster as “if (threadIdx.x < 256) … else if (threadIdx.x < 512) …” - such comparison is a single operation whose results go directly to predicate register

  2. SASS has two ways to perform divergent code - first is predicating: code is always executed but its results are completely ignored. With predicating, you will spend cycles performing operation itself but probably don’t use cache bandwidth.

Second way is conditional jumps around code not used by current warp. AFAIR, nvcc uses jumps for code sequences longer than ~5 operations. This way, you don’t spend time on unused code, but each warp performs extra operations - jumps itself and resyncing at the end of divergent code. Moreover, jumps break sequential instruction streaming from L1I cache, so it may cause extra delays unless you have enough other warps to cover delays.

If your code will not diverge, it’s advantageous to keep branches as similar as possible - f.e. you can compute base and data inside “if” and then perform generic “crdq[base + atomid] = data” outside it

With extra optimizations it may be:

PMEFloat *data;
int       base;

if (threadIdx.x < 256) {
     crdq[3*BATCHSIZE + atomid] = cSim.pAtomChargeSP[pos1];
     data = cSim.pFractX;
     base = 0;
   }
else if (threadIdx.x < 512) {
     data = cSim.pFractY;
     base = BATCHSIZE;
   }
else {
     data = cSim.pFractZ;
     base = 2*BATCHSIZE;
   }

crdq[base + atomid] = data[pos1];

Next topic is the speed of load/store operations itself. This depends on GPU generation, so i will talk about maxwell/paswell. Each SM here has 4 dispatchers. Each dispatcher has 32-wide ALU and 8-wide LD/ST engine. So,each cycle dispatcher can start ALU operation, but it can start LD/ST operation only once per 4 cycles. AFAIK, bandwidth of shared memory and L1 cache access is limited only by this value. So, copying data from L1$ to shared memory require 8 GPU cycles. I.e. simultaneously with such copying, up to 8 ALU operations can be performed.

But delays are pretty high - AFAIR, minimal delays of shmem/L1$ access are 30-50 cycles. Overall, GPU operation delay define when the next operation on its result can be started. I.e. if you only load data into register - thread will continue execution until this register will be used in other operation. When you perform “shmem = global” data copying, this means that we will have delay of 4 cycles for loading into register, then 30-50 cycles for data arrived, then 4 cycles for storing into shmem. If this operation followed by syncthreads(), we may need to wait another 30-50 cycles for data to be stored - i don’t know exactly (may be txbob, njuffa or greg can clarify that?).

So, the code we discussing will be executed at least 40 cycles, with only a few LD/ST and especially ALU operations executed. And here extra thread blocks can go a rescue!! While this thread block wastes time waiting for data moving, threads in another thread blocks may perform calculation part of their code. When the second thread block will start moving data, the first one may start performing calculations. It’s why it’s crucial for such style of kernel (load data into shmem → syncthreads → perform calculations) to have at least 2 thread blocks, and may be even more.

Of course, perfromance increase as function of thread blocks number depends on how much time you spend moving data. If it’s only 10% of overall time, two blocks will increase speed by 9% and third block by less than 1%. OTOH, if moving spends 50% of overall time, adding more than two blocks can add another 10-20% of performance.

But there is another way to increase performance - delayed moving. If you have enough spare registers, you can overlap loading into register set #1 with storing from register set #2, and swap their roles at the next cycle:


store set #2
load set #1
syncthreads()
calculations()
store set #1
load set #2
syncthreads()
calculations()

going further, you need to schedule stores as far as possible from the next syncthreads() in order to give them more time to finish, so even better code:


store set #1 - point 1
load set #2
calculations()
syncthreads()
store set #2 - point 2
load set #1
calculations() using data stored at point 1
syncthreads()

of course, this require to double shared memory usage since you need to have separate places for data stored at points 1 and 2. You may select your strategy (increasing amount of thread blocks, using two register sets, using two shared memory places) depending on what most limits your kernel - registers, shared memory, number of threads.