Controlling context switching in CUDA

We have many (say, 100) warps, which should perform in a loop two sub-routines. A() which does not need shared memory use, and B() which does. However, there is not enough shared memory for all of the warps together, but for e.g. 10 at a time. so we wanted to do something like
Loop {
A();
Check semaphores in a loop. If available {
Close semaphore. Put stuff from registers into shared memory.
B();
Open semaphore.
} else {
Mark this warp as “On hold” for some time.
}
}
How can I mark a warp as on hold? Can I in some way direct the context switching/issuing pool? Maybe a global memory request?

Assuming you are on a compute capability 2.0 or higher device, and the number of warp groups is at most 15 (you probably don’t have 100 warps because no current hardware supports running that many warps concurrently on one multiprocessor):

const int warp = (threadIdx.x + blockIdx.x * (threadIdx.y + blockIdx.y * threadIdx.z)) / warpSize;
const int group = warp/10;
const int ngroups = 10;

if (group==1)
    asm volatile("bar.arrive 1, 20*32;");

Loop {
    A();
    asm volatile("bar.sync %0, 20*32;" :: "r"(group+1));
    // Put stuff from registers into shared memory.
    B();
    asm volatile("bar.arrive %0, 20*32;" :: "r"((group+1) % ngroups + 1));
}

DISCLAIMER: written in browser, totally untested.

Thanks a lot. Still, we have several questions:
(Indeed, Computing capability 2.x)

1 In our case we do not need to wait for all threads to complete. It is enough that there is free buffer space in the shared memory, i.e. that less than 10 other warps are currently executing B(). Thus we don’t understand the use of bar.sync.
We would like to partition the shared memory into 10 parts, and whenever one of the (15, or 48, or 100) warps gets to step B(), he’ll choose one of these 10 that is free (using, probably, 10 flags somewhere), execute B() and free this area.
If your solution does solve that, it would help us to get some more explanations:

2 Can’t we use 48 warps (Maximum number of resident warps per multiprocessor, CUDA - Wikipedia )?
3 Why did you choose 2032? rather than 1532?
4 Line 5, shouldn’t be group==0?

Many thanks.

  1. The solution I posted uses a fixed grouping of warps, and assigns the entire shared memory to a group of 10 warps in a round-robin manner. It is a compromise between absolute flexibility and the effort necessary to achieve it.

    The completely flexible solution where any warp may grab any buffer as soon as it is free is also possible, although it’s likely to be outperformed by a less flexible solution.

    Here is code that however makes assumptions about undefined behavior when more threads than expected participate in a barrier (we assume the sleeping warps are still woken up), and that is inefficient as it consumes extra time when multiple warps are waiting for a buffer:
    [s]

    __shared__ unsigned int available_buffers = 0x3ff; /* as many bits set as buffers are available */
    unsigned int buffer_mask;
    
    Loop {
        A();
        do {
            asm volatile("bar.sync 1, 2*32;");
            int buffer = __ffs(available_buffers);
            if (lane==0)
                buffer_mask = atomicAnd(&available_buffers, ~(1<<buffer));
        } while (__all(buffer_mask & (1<<buffer) == 0));
        // Put stuff from registers into shared memory buffer #buffer.
        B(buffer);
        if (lane==0)
            atomicOr(&available_buffers, 1<<buffer);
        asm volatile("bar.arrive 1, 2*32;");
    }
    

    DISCLAIMER: again written in browser and totally untested.[/s] See the fully worked example below instead.

  2. Yes, you can use the maximum number of warps, e.g. 48 on compute capability 2.x devices. Just not 100 yet.
  3. 20*32 is the product of the number of warps participating in a barrier (20) and the warp size (32). The number of participating warps is 20 as in Nvidia's implementation it includes both the 10 warps waking up the next group and the 10 warps that are woken up.
  4. The group in this case can be any group but not group 0, as according to the documentation for bar.sync it is undefined whether a warp can wake up itself (although I'd guess it probably works).
  5. and
  6. %0 and "r"(argument) are explained in the PTX inline assembly documentation

What you can do is to turn off 9 out of 10 warps in a thread block. Then you’ll have all the space you reserved available for the warps you didn’t turn off.

Using a similar pseudo-code:

Loop {
    A();
    for( int i = 0; i < 10; i++ )
    {
        if( (warp_id%10) == i )
        {
            Put stuff from registers into shared memory.
            B();
        }
        barrier();
    }
}

Doesn’t sound entirely right to me. Can you put A and B into different kernels?

That is what my first suggestion was about - just without involving the other warps in the barrier, so they could do useful work in the meantime.

In general I agree though - from my experience these constructs usually are not worth adopting, as they divert too many processor cycles from useful work to administrative tasks.

Hi Tera
Thank you for your answers, it is really not likely that we would have got so far by ourselves. We’ll explore both alternatives.
Still some questions:

  1. I guess in line 8 the atomicAnd should be ~(1<<buffer), (11110111) and in line 12 the ~ should be removed (00001000).
  2. I guess the “do” part (lines 5-9) should be by the first thread of the warp only (inside if(threadIdx.x==0))? as otherwise it’ll lock the system (B works parallely using a full warp).
  3. Line 13 seems incomplete? what does %0 refer to? What is the use of barriers here, isn’t the management by available_buffers enough?
  4. Where can I get the details of Nvidia’s implementation of waking up warps? Is it documented somewhere? The documentation for bar.sync and bar.arrive in “PARALLEL THREAD EXECUTION ISA VERSION 3.1” (p. 160) is really not as complete as your use or it. The “waking up” idea is not mentioned at all. would 1*32 at least sub-optimally work?
  5. The “using inline PTX assembly in cuda” doc seems to have disappeared from the Nvidia site and is only found at http://wenku.baidu.com/view/8de46dc658f5f61fb7366615.html### .

Hi vvolkov
Unfortunately, not. We simulate a time course of a system (neuron cells where each warp represent a cell and each thread a part of the cell). In A() we independently compute each part
of the cell, while in B we integrate the parts together, so it really has to go ABABABABAB. A() writes to the registers in an ILP manner (Thanks vvolkov!!) and B uses the shared memory to integrate all the parts using a diffusion matrix. We would like B to integrate only warps that All the parts (threads) have been calculated - hence warps that all threads already executed A()

Using_Inline_PTX_Assembly_In_CUDA.pdf is included in cuda 5.0.

What do you need to know about the implementation of waking up threads? For all we know, there is no context switch on bar.sync or __syncthreads() - the switch may happen only when using the recently introduced dynamic parallelism or in debugging, if I understand correctly. Otherwise I see only two possible implementation of the thread block synchronization: (i) bar.sync is replayed until the sufficient number of warps execute the corresponding bar.arrive instructions - that would be similar to replaying memory instructions, and (ii) warp stalls on bar.sync, i.e. no further instructions are issued from it, again, until, again, all those bar.arrive-s are executed - this is similar to the stalls on register dependencies. I always thought it is (ii), but now when you ask I am less sure…

I still feel a bit confused why you can’t split it in two kernels and run the outer loop on the CPU. How about you flush the result of A() not into registers but into global memory, and then pick up this data from global memory in B() and use all shared memory you want to integrate?

Hi Roybens,

yes, your remarks 1. to 3. are correct. I’m sorry I silently corrected the mistakes in the code while there was no further reply in the thread, not knowing you were already testing out the buggy code.

Regarding 4., I’ve only used the PTX isa 3.1 specs. A number of 132 participating threads will not work though. It means the current warp is the only warp participating in the barrier, and will reduce the barrier instructions to just NOPs (no operation). 232 is chosen to pair one sleeping warp with one other to wake it up. Any higher number of warps participating in one barrier would need additional safeguarding so that no warps are left behind sleeping on their barriers and never woken up.

Note how my code is written to be fully functional without the barriers. They are just there to hopefully turn some busy-waiting into sleeping, knowing that often they will pair up wrongly so that to warps that are both supposed to sleep wake up each other.

In general the barriers provided by the PTX instruction set are not well suited for waiting on a specific event, as unfortunately the number of participating warps needs to be known before the first warps waits on the barrier.

The effectiveness of the barriers can be improved by pairing them up nicely. I’ve attached a fully worked example to this post.
The relevant portion of the code is this:

__device__ int get_buffer(void)
{
    unsigned int tid = threadIdx.x + blockIdx.x * (threadIdx.y + blockIdx.y * threadIdx.z);
    unsigned int lane = tid % warpSize;
    int buffer, success = 0;

    do {
        buffer = __ffs(available_buffers) - 1; // note that all threads of the warp will read the same value!
        if (lane==0) {
            success = atomicClearBit(&available_buffers, buffer);
            if (success && buffer >= NBUFFERS) {
                // no free buffers, we got a barrier to sleep on instead:
                unsigned int barrier = buffer - NBUFFERS;
                asm volatile("bar.sync %0, 2*32;" :: "r"(barrier));
            }
        }
    } while (__all(success == 0) || buffer >= NBUFFERS);
    return buffer;
}

__device__ void release_buffer(int buffer)
{
    unsigned int tid = threadIdx.x + blockIdx.x * (threadIdx.y + blockIdx.y * threadIdx.z);
    unsigned int lane = tid % warpSize;
    int success = 0;

    if (lane==0) {
        // put buffer back into available pool:
        atomicSetBit(&available_buffers, buffer);
        // and wake another warp (if any warps are sleeping):
        do {
            int barrier = __ffs(~(available_buffers >> NBUFFERS)) - 1;
            if (barrier >= NBARRIERS) {
                // no warps sleeping
                break;
            }
            success = atomicSetBit(&available_buffers, barrier + NBUFFERS);
            if (success) {
                // Small race condition:
                // An new warp might already have tried sleeping on the barrier, waking both the old and the
                // new warp in the process.  The bar.arrive will then pair up with the next bar.arrive instead
                // that releases the buffer to also reset the barrier:
                asm volatile("bar.arrive %0, 2*32;" :: "r"(barrier));
            }
        } while (success == 0);
    }
}


Loop {
    A();
    int buffer = get_buffer();
    B(buffer);
    release_buffer(buffer);
}

Note how again the code would work correctly without the barriers, they are just there to prevent busy-waiting. It’s also worth mentioning the code logic could be simplified considerably if shader assembly could be used, as it allows to define arbitrary atomic operations in shared memory.

With the attached code, the number of warps is limited to the number of buffers available plus 16, as PTX only knows 16 distinct barriers. Also no guarantee of fairness is made.

Contact me if you need more than 26 warps or fairness between warps.
barrier.cu (4.06 KB)

It appears attachments are once more not working. Here is the complete worked example as inline code instead:

#include <stdio.h>
#include <stdlib.h>

#define CUDA_CALL(x) {cudaError_t cuda_error__ = (x); if (cuda_error__) printf(#x " returned \"%s\"\n", cudaGetErrorString(cuda_error__));}

#define NBUFFERS 10
#define NBARRIERS 16 // maximum number possible because PTX (as of version 3.1) only has 16 barriers

// time related routines:

__device__ void wait(unsigned int ticks)
{
	unsigned int t = clock();
	unsigned int cycles = ticks * 100000;

	while (clock() - t < cycles)
		;
}

// dummy "work" routines:

__device__ void A()
{
	unsigned int tid = threadIdx.x + blockIdx.x * (threadIdx.y + blockIdx.y * threadIdx.z);
	unsigned int warp = tid / warpSize;
	unsigned int lane = tid % warpSize;

	if (lane == 0)
		printf("%8u  warp %2u begin A\n", (unsigned int)clock(), warp);
	wait(10);
	if (lane == 0)
		printf("%8u  warp %2u end   A\n", (unsigned int)clock(), warp);
}

__device__ void B(unsigned int buffer)
{
	unsigned int tid = threadIdx.x + blockIdx.x * (threadIdx.y + blockIdx.y * threadIdx.z);
	unsigned int warp = tid / warpSize;
	unsigned int lane = tid % warpSize;

	if (lane == 0)
		printf("%8u  warp %2u begin B(%u) %s ^\n", (unsigned int)clock(), warp, buffer,
		       "                              " + 30 - 3*buffer);
	wait(1);
	if (lane == 0)
		printf("%8u  warp %2u end   B(%u) %s v\n", (unsigned int)clock(), warp, buffer,
		       "                              " + 30 - 3*buffer);
}

#define Loop for (int loop=0; loop<10; loop++)

// aux functions for readability:

__device__ int atomicClearBit(unsigned int* address, unsigned int bit)
{
	unsigned int mask = 1 << bit;
	unsigned int before = atomicAnd(address, ~mask) & mask;
	int success = (before != 0);
	return success;
}

__device__ int atomicSetBit(unsigned int* address, unsigned int bit)
{
	unsigned int mask = 1 << bit;
	unsigned int before = atomicOr(address, mask) & mask;
	int success = (before == 0);
	return success;
}

// locking functions:

__shared__ unsigned int available_buffers;

__device__ void init(void)
{
	unsigned int tid = threadIdx.x + blockIdx.x * (threadIdx.y + blockIdx.y * threadIdx.z);

	if (tid==0)
		available_buffers = (1 << (NBUFFERS + NBARRIERS)) - 1; /* as many bits set as buffers and barriers are available */
	__syncthreads();
}

__device__ int get_buffer(void)
{
	unsigned int tid = threadIdx.x + blockIdx.x * (threadIdx.y + blockIdx.y * threadIdx.z);
	unsigned int lane = tid % warpSize;
	int buffer, success = 0;

	do {
		buffer = __ffs(available_buffers) - 1; // note that all threads of the warp will read the same value!
		if (lane==0) {
			success = atomicClearBit(&available_buffers, buffer);
			if (success && buffer >= NBUFFERS) {
				// no free buffers, we got a barrier to sleep on instead:
				unsigned int barrier = buffer - NBUFFERS;
				asm volatile("bar.sync %0, 2*32;" :: "r"(barrier));
			}
		}
	} while (__all(success == 0) || buffer >= NBUFFERS);
	return buffer;
}

__device__ void release_buffer(int buffer)
{
	unsigned int tid = threadIdx.x + blockIdx.x * (threadIdx.y + blockIdx.y * threadIdx.z);
	unsigned int lane = tid % warpSize;
	int success = 0;

	if (lane==0) {
		// put buffer back into available pool:
		atomicSetBit(&available_buffers, buffer);
		// and wake another warp (if any warps are sleeping):
		do {
			int barrier = __ffs(~(available_buffers >> NBUFFERS)) - 1;
			if (barrier >= NBARRIERS) {
				// no warps sleeping
				break;
			}
			success = atomicSetBit(&available_buffers, barrier + NBUFFERS);
			if (success) {
				// Small race condition:
				// An new warp might already have tried sleeping on the barrier, waking both the old and the
				// new warp in the process.  The bar.arrive will then pair up with the next bar.arrive instead
				// that releases the buffer to also reset the barrier:
				asm volatile("bar.arrive %0, 2*32;" :: "r"(barrier));
			}
		} while (success == 0);
	}
}

// usage example:

__global__ void barrier_test(void)
{
	init();

	Loop {
		A();
		int buffer = get_buffer();
		B(buffer);
		release_buffer(buffer);
	}
}

int main(int argc, const char * argv[])
{
	barrier_test<<< 1, (NBUFFERS + NBARRIERS) * 32 >>>();
	CUDA_CALL(cudaDeviceSynchronize());

    return 0;
}

Oh, wow.

That looks like a concept that I could really use for my current project.

Right now I use busy-waiting and spinlocks. Instead I could be sending warps to sleep when all shared memory buffers are in use.

Oh, I’ve hit a snag. Is there a way to formulate the get_buffer() without the warp vote function _all() ?
I am targeting compute 1.1 (using global atomics for the available_buffers mask)

Why is warp vote needed at all if only lane 0 (the first thread in each warp) ever gets to set the success variable?

After some heavy thinking, here’s my attempt at removing the warp vote:

__shared__ int warpsuccess[NBUFFERS + NBARRIERS];
    do {
        buffer = __ffs(available_buffers) - 1; // note that all threads of the warp will read the same value!
        if (lane==0) {
            warpsuccess[tid / warpSize] = atomicClearBit((unsigned int*)&available_buffers, buffer);
            if (warpsuccess[tid / warpSize] && buffer >= NBUFFERS) {
                // no free buffers, we got a barrier to sleep on instead:
                unsigned int barrier = buffer - NBUFFERS;
                asm volatile("bar.sync %0, 2*32;" :: "r"(barrier));
            }
        }
    } while (warpsuccess[tid / warpSize] == 0 || buffer >= NBUFFERS);

More snags: bar.arrive requires sm_20. Ouch.

Why does bar.sync take an argument even for sm_1x, if there is no bar.arrive to match it? hmm…

I don’t think you’ll get this working on compute capability 1.x devices, sorry. Main problem, as you noticed, is that bar.sync always involves all warps.
Maybe with some reverse engineering of undocumented behavior someone can come up with a clever use of the 16 different barriers available even on CC 1.x devices.

I used the warp vote (instead of just switching off all other threads) so that the other threads could receive the “buffer” value without having to pass it through shared memory (or a shuffle instruction, only available on CC 3.x).

Well I changed my compilation to sm_20 for the time being. Having integrated everything, unfortunately it blows up at some point. I’ll do some debugging and hopefully it’s just some minor issue or oversight on my part.

UPDATE:
Okay, I have this running now on a Kepler device (compute 3.0), targeting sm_20. The issue why it crashed were some shared memory alignment restrictions that exist in the unified pointer addressing scheme on sm_20, that were perfectly legal under sm_11.

But my goodness, did this experiment degrade my application’s performance! Spinlocks were way faster than what I am currently seeing with bar.sync / bar.arrive

This aligns with my findings. So far I’ve not come across any problem that could profit from these operations. There always was a faster way of doing it, using less tricks.