Fast DIy device emulation Introductory howto

This is a fork out of the Wishlist thread to avoid clutter there. I won’t do an Nvidia on you and keep you in the dark. If you have not seen anything like this before you are unlikely to guess. The secret to performance is to use setjmp/longjmp to do a thread context switch in __syncthreads(). No system calls and very few instructions, infact a very basic __syncthreads() is shown below for one single dimensional block. This is the only extra code running during an emulation so your code basically runs full tilt on a single core. It is a bit of a brain bender to set up the jmp_bufs in the launcher in a relatively portable way. Done properly will set up stack linkage so that gdb does not notice. Still a good assignment.

void

__syncthreads(void)

{

    State *state = g_state;

    Thread *th = state->run;

   if (!SETJMP(th->context))                   // Save context

    {

        if (++th == state->end)

        {

            th = state->thread;

        }

        state->run = th;

        threadIdx.x = th - state->thread;

        LONGJMP(th->context, 1);                // Dispatch

    }

}

Similar technology is used in large open source projects such as Apache and Firefox for internal threading - the state threads library. All you need for a basic emulator is a syncthreads and a kernel launcher to get going. From the documentation it appears Nvidia used a supervisory thread in their emulator, however I found it unnecessary. They used pthreads which introduces a lot of overhead. I am sure you will see a much faster Nvidia emulator very soon.

I started at the other end - syncthreads, kernel launch then the host interface. The driver API is easier to emulate. While it is not possible to do everything that <<<>>> can do without compiler customisations, wrapping it up in a macro and limiting the flexibility of passed parameters gets you running without type checking of kernel parameters. If you want to run more than one block in parallel (one does not have to to satisfy the spec) then all shared names need defining to name[physBlockID] for emulation (switch on CUDACC). Bit messy, however tolerable if your code is otherwise unemulatable, or even just for an almost immediate return from make.

Because these style of threads are not preemptable, everything is atomic which helps a lot - every thread between __syncthreads() is a critical section. Atomic intrinsics are a dodle.

I do encourage everyone to roll their own as it really helps understand the architecture and gives you a platform to experiment with architectural extensions. For those who like to be able to see what is going on to know what can be inferred by observed symptoms, your own emulator will be useful. I expect that this little gem will have use into the future as a SIMD cores are likely to appear in general purpose CPUs and have block warps scheduled like, and alongside, a current CPU thread by the one OS kernel.

Not having the time to support an open source emulator, I was hoping someone would pick up and run with it… Perhaps Nvida will open theirs <>

Eric

#include <stdio.h>

#include <setjmp.h>

#define __shared__

#define __device__

#define __global__

typedef struct CUfunc_st *CUfunction;

typedef struct{

	char a[16384];

}bigthingy;

typedef struct{

	int x,y,z;

}dim3;

struct CUfunc_st{

	void(*f)(bigthingy);//not very portable

	int x,y,z;

	int sz;

	bigthingy para;//parameters can't be larger than shared memory

};

typedef struct Thread_st{

	struct Thread_st *cdr;

	dim3 idx;

	jmp_buf ctx;

}Thread;

Thread *_thread0,*_thread1;

Thread _pool[512];

dim3 threadIdx;

dim3 blockIdx;

void __syncthreads(void){

   Thread *th = _thread1;

  if (!setjmp(th->ctx)){

       th=th->cdr;

       if (!th){

           th = _thread0;

       }

       _thread1 = th;

       threadIdx = th->idx;

       longjmp(th->ctx, 1);                // Dispatch

   }

}

void _launch(CUfunction f,Thread *th){

	if(!th)longjmp(_thread0->ctx,1);else{

  if(!setjmp(th->ctx)){

  	//if a CUDA kernel uses THIS much, it won't likely compile on GPU

  	void *p=alloca(65536);

  	_launch(f,th->cdr);

  }else{

  	f->f(f->para);

  	__syncthreads();

  }

	}

}

int cuFuncSetBlockShape(CUfunction f,int x,int y,int z){

	f->x=x;f->y=y;f->z=z;

	return 0;

}

int cuFuncSetSharedSize(CUfunction f,int s){

	f->sz=s;

	return 0;

}

int cuParamSetv(CUfunction f,int ofs,void *p,int sz){memcpy(f->para.a+ofs,p,sz);return 0;}

int cuParamSeti(CUfunction f,int ofs,int v){cuParamSetv(f,ofs,&v,sizeof(v));return 0;}

int cuParamSetf(CUfunction f,int ofs,float v){cuParamSetv(f,ofs,&v,sizeof(v));return 0;}

int cuParamSetSize(CUfunction _1,int _2){return 0;}

int cuLaunchGrid(CUfunction f,int gx,int gy){

	int i,j,k,n;

	blockIdx.z=0;

	n=0;

	for(i=0;i<f->z;i++)

  for(j=0;j<f->y;j++)

  	for(k=0;k<f->x;k++){

    _pool[n].idx.x=k;

    _pool[n].idx.y=j;

    _pool[n].idx.z=i;

    _pool[n].cdr=_pool+n+1;

    n++;

  	}

	_pool[n-1].cdr=NULL;

	for(blockIdx.y=0;blockIdx.y<gy;blockIdx.y++){

  for(blockIdx.x=0;blockIdx.x<gx;blockIdx.x++){

  	_thread1=_thread0=_pool;

  	_launch(f,_thread0);

  }

	}

	return 0;

}

int sh[4096];

void callcuda(void *a,int gx,int bx,int ssz,...){

	struct CUfunc_st ff;

	CUfunction f=&ff;

	*(void**)&f->f=a;

	cuFuncSetBlockShape(f,bx,1,1);

	cuFuncSetSharedSize(f,ssz);

	cuParamSetv(f,0,&ssz+1,sizeof(bigthingy));

	cuLaunchGrid(f,gx,1);

}

__global__ void test(char *fmt){

	extern __shared__ int sh[];

	int x;

	int tid=threadIdx.x;

	printf(fmt,blockIdx.x,tid);

	if(tid==7)puts("");

	sh[tid]=1;

	__syncthreads();

	if(tid<4)sh[tid]+=sh[tid+4];

	__syncthreads();

	if(tid<2)sh[tid]+=sh[tid+2];

	__syncthreads();

	if(tid<1)sh[tid]+=sh[tid+1];

	__syncthreads();

	printf("%d",sh[0]);

	if(tid==7)puts("");

}

int main(){

	int junk[65536];

	return _main();

}

#define main _main

int main(){

	callcuda((void*)&test,8,8,0,"B%d.T%d   ");

	return 0;

}

Starting at the other end is indeed a good idea. Now I got the assignment done. Being taught C++, I keep forgetting about setjmp/longjmp in C.

I can incorporate your __syncthreads into my code right? I’d write a thanks in the comments.

The CPU emulation path is a debugging path. We only run one thread a time to keep threads in sync and simple to debug. It is not meant to be a performance path.
Dealing with printf’s and breakpoints with multiple threads running at the same time can be quite confusing.

The CPU emulation path is a debugging path. We only run one thread a time to keep threads in sync and simple to debug. It is not meant to be a performance path.

We know that, but our point is: an hour per frame hardly makes a useful debug path.

Dealing with printf and breakpoints is indeed confusing, but it’s at least better than nothing. After all, this setjmp thingy is not hard to implement. If one of you guys could spend a day to do it, we’d be really grateful.

So you guys, is there a ready-made piece of code that people can just take and use to do faster debugging? It would be much appreciated, plus some instructions. It’d also be nice if single-threadedness was an option (both single and multi have their own clear advantages and uses).

Anyway, way to go making those nvidia engineers look a bit silly!

Mfatica, it’s easy to draw up some goals on a whiteboard and say “performance doesn’t matter for debugging” but it doesn’t make you right.

Can’t you reduce the resolution? You’re very unlucky indeed if your bug only shows up at full 1080p.

The problem is that resolution doesn’t even matter. The entire hour is spent on a big preprocessing that is error prone but resolution independent:(

The thing is, not likely:(

If one could spend another day on the code I posted, we would have emulated most of the driver API. But that won’t help people who use <<< and >>> much.

If one also happens to use my macro processor in the other thread, debugging could be relatively easily achieved (I have a macro to use driver API in a slightly shorter syntax). If anyone ever intend to use that, I’ll write a manual.

I’ll try to include this emulation in my nvcc replacement when I have time.

Sure it’s now in the public domain - a credit would be polite. I have seen weaker patents.

Yes and this technique is identical, single running thread keeps everything simple. Just toss out pthreads.

I did a quick benchmark against the current working emulator using 32 threads to give you an idea of what can be saved. My first measurement mentioned in Wishlist was against a broken emulator… the only real difference is speed of __syncthreads() - above is 1.2us (warp size 1 and no race condition detection), with proper warp size mapping, multidimensional and race checking I get 2.9us compared to a whopping 47.6us for Nvidia __syncthreads() in v1.0 emulation. I stopped using the Nvidia emulator before even turning on any hardware.

Eric

ed: Before anyone jumps up & down those times quoted were for 32 __syncthreads() calls in 1 warp.

What’s race checking? If it what it sounds like?

Seriously, guys, look into making it a full package. For “<<<”, just make an awk script or something.

A better interface to the driver API also sounds great. The runtime api is showing serious holes when it has to deal with many kernels.

External Image External Image External Image

i just ported my code to the driver api and im getting the wrong results, for exactly the same kernel, and as far as i can tell the same inputs (though if it is so then there is no other place for some thing to go wrong) i tried using the above code to run my code in emulation mode but it didn’t work for me. Has any body manged to get this thing working for a more complicated case ?

Thanks

-e