This is a fork out of the Wishlist thread to avoid clutter there. I won’t do an Nvidia on you and keep you in the dark. If you have not seen anything like this before you are unlikely to guess. The secret to performance is to use setjmp/longjmp to do a thread context switch in __syncthreads(). No system calls and very few instructions, infact a very basic __syncthreads() is shown below for one single dimensional block. This is the only extra code running during an emulation so your code basically runs full tilt on a single core. It is a bit of a brain bender to set up the jmp_bufs in the launcher in a relatively portable way. Done properly will set up stack linkage so that gdb does not notice. Still a good assignment.
void
__syncthreads(void)
{
State *state = g_state;
Thread *th = state->run;
if (!SETJMP(th->context)) // Save context
{
if (++th == state->end)
{
th = state->thread;
}
state->run = th;
threadIdx.x = th - state->thread;
LONGJMP(th->context, 1); // Dispatch
}
}
Similar technology is used in large open source projects such as Apache and Firefox for internal threading - the state threads library. All you need for a basic emulator is a syncthreads and a kernel launcher to get going. From the documentation it appears Nvidia used a supervisory thread in their emulator, however I found it unnecessary. They used pthreads which introduces a lot of overhead. I am sure you will see a much faster Nvidia emulator very soon.
I started at the other end - syncthreads, kernel launch then the host interface. The driver API is easier to emulate. While it is not possible to do everything that <<<>>> can do without compiler customisations, wrapping it up in a macro and limiting the flexibility of passed parameters gets you running without type checking of kernel parameters. If you want to run more than one block in parallel (one does not have to to satisfy the spec) then all shared names need defining to name[physBlockID] for emulation (switch on CUDACC). Bit messy, however tolerable if your code is otherwise unemulatable, or even just for an almost immediate return from make.
Because these style of threads are not preemptable, everything is atomic which helps a lot - every thread between __syncthreads() is a critical section. Atomic intrinsics are a dodle.
I do encourage everyone to roll their own as it really helps understand the architecture and gives you a platform to experiment with architectural extensions. For those who like to be able to see what is going on to know what can be inferred by observed symptoms, your own emulator will be useful. I expect that this little gem will have use into the future as a SIMD cores are likely to appear in general purpose CPUs and have block warps scheduled like, and alongside, a current CPU thread by the one OS kernel.
Not having the time to support an open source emulator, I was hoping someone would pick up and run with it… Perhaps Nvida will open theirs <>
Eric