Using thread ID to determine what functions a thread performs

Excuse the title, I wasn’t entirely sure how to describe this. I am currently working on a project that involves tracking fast ions in a fusion plasma. What ends up happening is that I don’t really have enough particles to really utilize the full potential of the gpu if I only use 1 thread per particle. So what I ended up looking into is having a divergent kernel, but avoiding diverging warps. The idea is that several threads can do separate pieces of work for the same particle, better utilizing shared memory.

So here is the idea: Use the thread’s ID to determine what function the thread will perform within the kernel, and whether it is faster to use this method, divide the problem into separate kernel launches, or just launch a “normal” kernel and have only 1 thread for the 4 operations.

What I found out was kind of interesting:

[spoiler][codebox]Functor Kernel took 200.9 ms

Switch Kernel took 205.1 ms

If Kernel took 201.0 ms

Straight Kernel took 296.3 ms

4 Kernel Test took 424.8 ms

[/codebox][/spoiler]

The fastest method was using a switch statement that used threadIdx.y to determine what functor to use, and then called a device function with that functor object.

[spoiler][codebox]global

void FunctorArraytest(double* A1,double* B1,double* C,int n)

{

unsigned int gidx = blockIdx.x*blockDim.x+threadIdx.x;

unsigned int idx = threadIdx.x;

unsigned int idy = threadIdx.y;

if (gidx < n)

{

	__shared__ double A[BLOCK_SIZE+1];

	__shared__ double B[BLOCK_SIZE+1];

	switch (idy)

	{

	case 0:

		A[idx] = A1[gidx];

		break;

	case 1:

		B[idx] = B1[gidx];

		break;

	default:

		break;

	}

	__syncthreads();

	switch (idy)

	{

	case 0:

		vectorop(A[idx],B[idx],&C[4*gidx],Add());

		break;

	case 1:

		vectorop(A[idx],B[idx],&C[4*gidx+1],Sub());

		break;

	case 2:

		vectorop(A[idx],B[idx],&C[4*gidx+2],Multiply());

		break;

	case 3:

		vectorop(A[idx],B[idx],&C[4*gidx+3],Divide());

		break;

	default:

		break;

	}

}

}[/codebox][/spoiler]

Just putting the operation under an IF or SWITCH statement performed marginally slower than the functor kernel, but still faster than the brute force kernel, and more than twice as fast as the 4 kernel test.

I can understand why the 4 kernel test was so much slower, it has to launch 4 separate kernels, and load the data into shared memory multiple times.

Something else I tried, but couldn’t get to work was trying to create an array of functors, and then use the threadIdx.y as an index in that array, much like a function pointer, however I could not get this to work. (if function pointers were supported it would make this much easier.) I tried several workarounds, but couldn’t really get anything to work.

I’ve attached my test code. The code was written to run on a GTX470, so your results may vary.

I might just be blowing smoke and missing something, but so far it looks like this might be a decent way to break certain problems down even further

[attachment=23257:functortest.cu].