Is dynamic code generation possible?

Hi all, I’m new to CUDA and this is my first post here. I have a somewhat unique situation and am unsure if it’s solvable with CUDA.

I have an algorithm that has many conditionals in it, for example:

void Func()
{
if (condition1)
CallOtherFunc1();
else if (condition2)
CallOtherFunc2()
.
.
.
else if (condition100)
CallOtherFunc100()
}

From everything I’ve read and seen, having this type of code in a GPU kernel is disastrous for two reasons:

  1. The GPU doesn’t like branching at all.
  2. It must load the code for all of the possible functions it might call, which won’t nicely fit into the registers.

Also, I tried writing a small test using classes with virtual functions, only to find it was TWO HUNDRED times slower, so that’s out of the question.

The good part is that I know what these conditionals will be ahead of time and they’ll remain fixed during the execution of the kernel.

So my questions is: Is dynamic kernel generation, compilation and execution possible at runtime? I would like to do something like this:

//Evaluate all conditionals and grab only the functions needed.
//Package only needed functions into a kernel
//Compile
//ExecuteCompactFunction<<<123, 123>>>()

Also, if such a thing is possible, does the target machine have to install the nvcc compiler? Or is this capability native to any CUDA executable program?

The closest thing to what I’ve described is GLSL, where you can compose a string of your code at runtime, then hand it to GLSL to compile and run, which happens nearly instantaneously. I’m aiming to get the same thing in CUDA.

Any input is much appreciated, thanks.

Given that you know the configuration parameters ahead of time, and that they are constant across a kenel, have you looked into template classes? I have personally used that approach for kernels with a handful of parameters, resulting in tens of different of kernels, invoking the desired kernels through a function pointer. This approach probably scales to a couple hundred kernels before compilation become annoyingly slow.

Dynamic code generation is possible. There are CUDA users who have written their own PTX code generators which they kick off at run time to build custom kernels that they load through CUDA’s fat binary loader API. Other CUDA users even build dynamically from CUDA source code by invoking nvcc through appropriate system APIs, the loading the resulting code. Obviously such mechanisms can get fairly involved, so I would suggest first trying the template approach, which is almost trivial to realize (basically, kernel arguments are turned into template parameters).

Thanks for the reply. Sadly, the template approach will not work.

I’ve been reading since my post and here’s what I think the best route would be:

-The functions, 1 - 100, are fixed and won’t change, so I should compile them into PTX output.
-Copy the output back into my source as an array of strings, one entry for each function.
-Dynamically generate the code at runtime by concatenating the strings for whichever functions were selected, and compiling them all into a new function.
-Call the newly created function.

Does that sound reasonable?

Another option is to create a “subroutine threaded” interpreter that executes CUDA functions.

Fermi+Kepler support indirect calls which means you can call pointers to functions.

A working example can be found here: https://gist.github.com/4243479

The organization of such an approach is:

  1. Define a "dictionary" of all the functions you might possibly want to execute. A simple implementation will require their prototypes to be identical.
  2. Define an empty constant "program" array holds indexes into the dictionary.
  3. At run-time initialize this array with the program you want to be executed.

Your entire kernel is this:

__global__
LAUNCH_BOUNDS
void start(int* data)
{
  unsigned int ip = 0;

  while (true)
    (*dictionary[program[ip++]])(data);
}

Note in the example that one of the defined functions is “cudaExit()” which exits the kernel, escaping the infinite loop.

The neat thing is that the kernel PTX looks like this:

.visible .entry _Z5startPi(
	.param .u32 _Z5startPi_param_0
)
{
	.reg .s32 	%r<11>;

	ld.param.u32 	%r3, [_Z5startPi_param_0];
	mov.u32 	%r10, program;
	// inline asm
	prototype_0 : .callprototype ()_ (.param .b32 _);
	// inline asm

BB4_1:
	.loc 2 71 1
	ld.const.u32 	%r5, [%r10];
	shl.b32 	%r6, %r5, 2;
	mov.u32 	%r7, dictionary;
	add.s32 	%r8, %r7, %r6;
	ld.const.u32 	%r9, [%r8];
	// Callseq Start 0
	{
	.reg .b32 temp_param_reg;
	.param .b32 param0;
	st.param.b32	[param0+0], %r3;
	.loc 2 71 1
	call 
	%r9, 
	(
	param0
	)
	, prototype_0;
	}
	// Callseq End 0
	add.s32 	%r10, %r10, 4;
	bra.uni 	BB4_1;
}

Some background on “threaded code” is here.

I’ve been wanting to try this on CUDA for a long time. :)

The SASS is fun to look at too.

LDC : Load from Constant
BRX : Branch to Relative Indexed Address
PRET : Pre-Return Relative Address (?)
RET : Return

Line 12 (“BRA 0x18”) sends you back to line 5.

Everything after 12 is functions in your “dictionary”.

code for sm_30
		Function : _Z5startPi
	/*0008*/     /*0x10005de428004001*/ 	MOV R1, c [0x0] [0x44];
	/*0010*/     /*0xfc001de428000000*/ 	MOV R0, RZ;
	/*0018*/     /*0x00009c8614000c00*/ 	LDC R2, c [0x3] [R0];
	/*0020*/     /*0x08209c036000c000*/ 	SHL R2, R2, 0x2;
	/*0028*/     /*0x00011de428004005*/ 	MOV R4, c [0x0] [0x140];
	/*0030*/     /*0x00209c8614000c10*/ 	LDC R2, c [0x3] [R2+0x400];
	/*0038*/     /*0x4001000778000000*/ 	PRET 0x50;
	/*0048*/     /*0xc0201de74803fffe*/ 	BRX R2 0x0;
	/*0050*/     /*0x10001c034800c000*/ 	IADD R0, R0, 0x4;
	/*0058*/     /*0xe0001de74003fffe*/ 	BRA 0x18;
	/*0060*/     /*0x84009c042c000000*/ 	S2R R2, SR_Tid_X;
	/*0068*/     /*0x1020dc4340000000*/ 	ISCADD R3, R2, R4, 0x2;
	/*0070*/     /*0x00309c8580000000*/ 	LD R2, [R3];
	/*0078*/     /*0x28209ca35000c000*/ 	IMUL R2, R2, 0xa;
	/*0088*/     /*0x00309c8590000000*/ 	ST [R3], R2;
	/*0090*/     /*0x00001de790000000*/ 	RET;
	/*0098*/     /*0x84009c042c000000*/ 	S2R R2, SR_Tid_X;
	/*00a0*/     /*0x1020dc4340000000*/ 	ISCADD R3, R2, R4, 0x2;
	/*00a8*/     /*0x00309c8580000000*/ 	LD R2, [R3];
	/*00b0*/     /*0x3c209c034800c000*/ 	IADD R2, R2, 0xf;
	/*00b8*/     /*0x00309c8590000000*/ 	ST [R3], R2;
	/*00c8*/     /*0x00001de790000000*/ 	RET;
	/*00d0*/     /*0x84009c042c000000*/ 	S2R R2, SR_Tid_X;
	/*00d8*/     /*0x1400dde218000000*/ 	MOV32I R3, 0x5;
	/*00e0*/     /*0x10209c4340000000*/ 	ISCADD R2, R2, R4, 0x2;
	/*00e8*/     /*0x0020dc8590000000*/ 	ST [R2], R3;
	/*00f0*/     /*0x00001de790000000*/ 	RET;
	/*00f8*/     /*0x00001de780000000*/ 	EXIT;
	/*0100*/     /*0xe0001de74003ffff*/ 	BRA 0x100;
	/*0108*/     /*0x00001de440000000*/ 	NOP CC.T;
	/*0110*/     /*0x00001de440000000*/ 	NOP CC.T;
	/*0118*/     /*0x00001de440000000*/ 	NOP CC.T;
	/*0120*/     /*0x00001de440000000*/ 	NOP CC.T;
	/*0128*/     /*0x00001de440000000*/ 	NOP CC.T;
	/*0130*/     /*0x00001de440000000*/ 	NOP CC.T;
	/*0138*/     /*0x00001de440000000*/ 	NOP CC.T;
		...........................

Wow allanmac, that’s very clever. Thanks for the snippets and the link.

One concern of mine is efficiency, is there any performance hit for using function pointers?

Forgive my ignorance of the PTX/SASS code, I’m not sure what it means. Were you posting it to illustrate that using this solution is efficient?

Thanks.

If your subroutines/functions are really tiny – like a few instructions – then this is probably not a great fit.

Also, if your conditionals are precomputed then I suspect that passing in a dense version of your conditionals (e.g. 64 bits) on kernel launch might be a good and simple solution to your problem. Just test the appropriate bit and execute the subroutine if true. Skipping over code is pretty fast. I, of course, don’t know what your performance goals are. :)

By the way you wrote your question, I think readers are going to assume that every thread in your kernel will be using the same set of conditionals. Hopefully that’s the case.

Divergent branching in a warp is where performance gets hammered. If every thread in a warp is branching in the same direction then you probably don’t have anything to worry about.

Yes, the snippets are just meant to illustrate how terse the inner loop winds up being in PTX/SASS. :)

Also note the example I gave is really primitive and you could probably shrink the dispatch logic even smaller by implementing true threaded code (if the compiler allows it). The point of the example is really to show that indirect calling in sm_20/21/30/35 has probably not yet been fully abused explored by CUDA hackers. Not sure anyone will need to either. :)

Thanks for the additional info. I will keep all of that in mind.

I’m curious why you think it might not be a great fit for small functions though? Do you think the overhead of the function pointer would take longer than executing the functions themselves?

For my case, the functions are pretty small and involve just a few lines of math each.

I plan on abusing this implementation =)

If your functions are only a few instructions and the dispatch logic is a few instructions per call then the overhead suddenly seems rather significant (even if it were further optimized down to a few instructions).

If it’s really small functions and performance is critical then code generation or strategic template’ing as @njuffa noted starts to make a lot of sense.

I’m sure others will post additional alternatives.

Do you only ever intend one function to be called per kernel invocation? Or is there a loop around Func inside your kernel such that more than one of the functions may be called on each kernel invocation?

If only one branch will be taken per kernel, then the template approach should be semi-practical since you will only have a maximum of 100 possible branches and not 2^100. You could decide before you launch the kernel what the first true condition is.

@eelsen, I was wondering the same thing about the wording of the original example. If the conditionals are truly exclusive (if/else-if…) then it’s only a modest 100 separate kernels or functions. But later in @du’s original question it’s implied that multiple conditionals can evaluate to true. Not sure which is right.

If the interpretation that @du needs to be able to run multiple conditionals is correct then, I think a simple switch statement may be enough.

The arguments against were:

  1. branching - if it is indeed the case that all threads are taking the same path, then this isn’t actually a problem.
  2. Register pressure - I’ve done some experiments and it looks to me like the register usage is the maximum of all the branches not their sum. If your branches are all short and relatively similar, I don’t think this is a big deal. If one of the branches uses 20 more registers than any of the others then this objection does matter.

Hey guys, sorry for not being clear. My case is where up to 12 of them can be true, so templating wouldn’t work.

Good call on the register pressure eelsen, it’s the max and not the sum.

I will try it with branching and with function pointers to see which runs the fastest. It’ll be a month or so before I do it (trying to lay some design groundwork early), but I’ll report back here with my findings when I’m done.

Thanks.

Just for kicks I tweaked the code to shrink the inner loop.

This:

__global__
LAUNCH_BOUNDS
void start(int* const data)
{
  unsigned int pc = 0;

  while (true)
    deviceProgram[pc++](data);
}

is now accomplished in lines 5-10. Line 6 can probably be ignored. So a total of 5 SASS instructions:

code for sm_30
		Function : _Z5startPi
	/*0008*/     /*0x10005de428004001*/ 	MOV R1, c [0x0] [0x44];
	/*0010*/     /*0xfc001de428000000*/ 	MOV R0, RZ;
	/*0018*/     /*0x00009c8614000c00*/ 	LDC R2, c [0x3] [R0];
	/*0020*/     /*0x00011de428004005*/ 	MOV R4, c [0x0] [0x140];
	/*0028*/     /*0x2001000778000000*/ 	PRET 0x38;
	/*0030*/     /*0x20201de74803ffff*/ 	BRX R2 0x0;
	/*0038*/     /*0x10001c034800c000*/ 	IADD R0, R0, 0x4;
	/*0048*/     /*0x20001de74003ffff*/ 	BRA 0x18;
	/*0050*/     /*0x84009c042c000000*/ 	S2R R2, SR_Tid_X;
	/*0058*/     /*0x1020dc4340000000*/ 	ISCADD R3, R2, R4, 0x2;
	/*0060*/     /*0x00309c8580000000*/ 	LD R2, [R3];
	/*0068*/     /*0x28209ca35000c000*/ 	IMUL R2, R2, 0xa;
	/*0070*/     /*0x00309c8590000000*/ 	ST [R3], R2;
	/*0078*/     /*0x00001de790000000*/ 	RET;
	/*0088*/     /*0x84009c042c000000*/ 	S2R R2, SR_Tid_X;
	/*0090*/     /*0x1020dc4340000000*/ 	ISCADD R3, R2, R4, 0x2;
	/*0098*/     /*0x00309c8580000000*/ 	LD R2, [R3];
	/*00a0*/     /*0x3c209c034800c000*/ 	IADD R2, R2, 0xf;
	/*00a8*/     /*0x00309c8590000000*/ 	ST [R3], R2;
	/*00b0*/     /*0x00001de790000000*/ 	RET;
	/*00b8*/     /*0x84009c042c000000*/ 	S2R R2, SR_Tid_X;
	/*00c8*/     /*0x1400dde218000000*/ 	MOV32I R3, 0x5;
	/*00d0*/     /*0x10209c4340000000*/ 	ISCADD R2, R2, R4, 0x2;
	/*00d8*/     /*0x0020dc8590000000*/ 	ST [R2], R3;
	/*00e0*/     /*0x00001de790000000*/ 	RET;
	/*00e8*/     /*0x00001de780000000*/ 	EXIT;
	/*00f0*/     /*0xe0001de74003ffff*/ 	BRA 0xf0;
	/*00f8*/     /*0x00001de440000000*/ 	NOP CC.T;
		...........................

The gist@github was also updated.

The generated ISA for both the if/else-if or a switch statement shows that the compiler has optimized them to a jump table using the BRX instruction as well, so I don’t imagine there would be a large difference in performance between that approach and allanmac’s (as long as the # of registers is fairly uniform per branch).

So I’ve done some time testing and am noticing serious performance problems in the most bizarre places.

Is anyone else seeing that function pointers can sometimes execute about 4 times slower than a straight function call?

I will post a code example tomorrow.

Here is a condensed version of what I’m working with to illustrate the point. The problem is explained in the comments in kernel.cu. Based on moving one line of code, it can slow down the execution speed by 350%. I’ve diffed the ptx files for the slow and fast builds, and there are many more differences than one would expect for such a small change. It seems like the compiler is trying to be smart with some inling, although I don’t really know how to read ptx. Can you guys try the same and see what you get?

I am running VS2010, sm_20, 32-bit build, on Windows 7 x64. I’d prefer not to post a ton of code, but it won’t let me attach a project file as a zip, so here are all the files one by one. The main file of interest is kernel.cu. The core of the problem is:

Based on how we increment our count of variations in CuFlameXform::AddVariation(), it changes the execution speed of CuFlameXform::Apply(). I’m utterly stumped, so any help would be much appreciated. Thanks.

Timing.h

#pragma once

#include <Windows.h>
#include <stdio.h>

class Timing
{
public:
	Timing();
	void Tic();
	double Toc();
	double Toc(char* str);
	double BeginTime();
	double EndTime();
	static double Freq();
	static int NumOfProcessors();
 
private:
	static inline void Init();
 
	LARGE_INTEGER m_BeginTime;
	LARGE_INTEGER m_EndTime;
	static bool m_TimingInit;
	static int m_NumOfProcessors;
	static LARGE_INTEGER m_Freq;
};

Timing.cpp:

#include "Timing.h"

bool Timing::m_TimingInit = false;
int Timing::m_NumOfProcessors;
LARGE_INTEGER Timing::m_Freq;

Timing::Timing()
{
	Init();
	Tic();
}
 
void Timing::Tic()
{
	QueryPerformanceCounter(&m_BeginTime);
}
 
double Timing::Toc()
{
	return Toc(NULL);
}
 
double Timing::Toc(char* str)
{
	QueryPerformanceCounter(&m_EndTime);
	double calcTime = double(m_EndTime.QuadPart - m_BeginTime.QuadPart) * 1000.0 / double(m_Freq.QuadPart);
 
	if (str != NULL)
		printf("%s processing time: %f ms
", str, calcTime);
 
	return calcTime;
}
 
double Timing::BeginTime() { return (double)m_BeginTime.QuadPart; }
double Timing::EndTime() { return (double)m_EndTime.QuadPart; }
 
double Timing::Freq()
{
	Init();
	return (double)m_Freq.QuadPart;
}
 
int Timing::NumOfProcessors()
{
	Init();
	return m_NumOfProcessors;
}
 
void Timing::Init()
{
	if (!m_TimingInit)
	{
		SYSTEM_INFO sysinfo;
 
		QueryPerformanceFrequency(&m_Freq);
		GetSystemInfo(&sysinfo);
		m_NumOfProcessors = sysinfo.dwNumberOfProcessors;
		m_TimingInit = true;
	}
}

kernel.cu:

#include <stdio.h>
#include <stdint.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>

#define EPS			 (1e-10)

struct CuFlamePoint
{
	float m_X;
	float m_Y;
	float m_ColorX;
};

struct CuFlameIteratorHelper
{
	float m_TransX, m_TransY;
	float m_PrecalcSumSquares;
	float m_PrecalcSqrtSumSquares;
	float m_PrecalcSina;
	float m_PrecalcCosa;
	float m_PrecalcAtanxy;
	float m_PrecalcAtanyx;
};

typedef void (*VariationFunc)(float weight, CuFlameIteratorHelper& helper, CuFlamePoint* outPoint);

__device__ void Linear(float weight, CuFlameIteratorHelper& helper, CuFlamePoint* outPoint)
{
	outPoint->m_X += weight * helper.m_TransX;
	outPoint->m_Y += weight * helper.m_TransY;
}

__device__ void Sinusoidal(float weight, CuFlameIteratorHelper& helper, CuFlamePoint* outPoint)
{
	outPoint->m_X += weight * sin(helper.m_TransX);
	outPoint->m_Y += weight * sin(helper.m_TransY);
}

__device__ void Spherical(float weight, CuFlameIteratorHelper& helper, CuFlamePoint* outPoint)
{
	float r2 = weight / (helper.m_PrecalcSumSquares + EPS);

	outPoint->m_X += r2 * helper.m_TransX;
	outPoint->m_Y += r2 * helper.m_TransY;
}

#define MAX_VARS_PER_XFORM 8

class CuFlameXform
{
public:
	__device__ CuFlameXform()
	{
		Init();
	}

	__device__ CuFlameXform(float colorX, float a, float b, float c, float d, float e, float f)
	{
		Init();
		m_ColorX = colorX;
		
		m_A = a;
		m_B = b;
		m_C = c;
		m_D = d;
		m_E = e;
		m_F = f;
		m_ColorSpeed = 0.5;
		m_ColorSpeedCache = m_ColorSpeed * m_ColorX;
		m_OneMinusColorCache = float(1.0) - m_ColorSpeed;
	}

	__device__ ~CuFlameXform()
	{
		Init();
	}

	__device__ void Init()
	{
		m_VariationCount = 0;
		memset(m_Variations, 0, MAX_VARS_PER_XFORM * sizeof(VariationFunc));
		memset(m_VariationWeights, 0, MAX_VARS_PER_XFORM * sizeof(float));
	}

	__device__ void Apply(CuFlamePoint* inPoint, CuFlamePoint* outPoint)
	{
		CuFlameIteratorHelper iterHelper;

		outPoint->m_ColorX = m_ColorSpeedCache + (m_OneMinusColorCache * inPoint->m_ColorX);
		iterHelper.m_TransX = (m_A * inPoint->m_X) + (m_B * inPoint->m_Y) + m_C;
		iterHelper.m_TransY = (m_D * inPoint->m_X) + (m_E * inPoint->m_Y) + m_F;
		Precalc(iterHelper);
		outPoint->m_X = outPoint->m_Y = 0;

		//We know we added 3 variations for this test, so just hard code calls to the first three just for demo.
		//for (unsigned int i = 0; i < m_VariationCount; i++)//Or, doing this also makes it 3.5 times slower, regardles of where we increment m_VariationCount below.
		for (unsigned int i = 0; i < 3; i++)//This will be 3.5 times faster if we increment m_VariationCount outside of the if block below.
			m_Variations[i](m_VariationWeights[i], iterHelper, outPoint);
	}

	__device__ void AddVariation(VariationFunc variation, float weight)
	{
		if (m_VariationCount < MAX_VARS_PER_XFORM)
		{
			m_Variations[m_VariationCount] = variation;
			m_VariationWeights[m_VariationCount] = weight;
			
			//m_VariationCount++;//Putting this here makes the call of the function pointers 3.5 times slower.
		}

		//For some reason, pulling this out of the above conditional makes calling a func ptr roughly the speed as calling a regular function.
		//Should make no difference since we will always execute this line since we're only adding 3 variations in our test which is always 
		//below the limit of 8.
		m_VariationCount++;
	}

	__device__ void ClearVariations()
	{
		m_VariationCount = 0;
		memset(m_Variations, 0, MAX_VARS_PER_XFORM * sizeof(VariationFunc));
		memset(m_VariationWeights, 0, MAX_VARS_PER_XFORM * sizeof(float));
	}
	
	float m_A, m_B, m_C, m_D, m_E, m_F;
	float m_ColorX;
	float m_ColorSpeed;
	float m_ColorSpeedCache;
	float m_OneMinusColorCache;

private:
	__device__ void Precalc(CuFlameIteratorHelper& iteratorHelper)
	{
		iteratorHelper.m_PrecalcSumSquares = (iteratorHelper.m_TransX * iteratorHelper.m_TransX) + (iteratorHelper.m_TransY * iteratorHelper.m_TransY);
		iteratorHelper.m_PrecalcSqrtSumSquares = sqrt(iteratorHelper.m_PrecalcSumSquares);
		iteratorHelper.m_PrecalcSina = iteratorHelper.m_TransX / iteratorHelper.m_PrecalcSqrtSumSquares;
		iteratorHelper.m_PrecalcCosa = iteratorHelper.m_TransY / iteratorHelper.m_PrecalcSqrtSumSquares;
		iteratorHelper.m_PrecalcAtanxy = atan2(iteratorHelper.m_TransX, iteratorHelper.m_TransY);
		iteratorHelper.m_PrecalcAtanyx = atan2(iteratorHelper.m_TransY, iteratorHelper.m_TransX);
	}

	VariationFunc m_Variations[MAX_VARS_PER_XFORM];
	float m_VariationWeights[MAX_VARS_PER_XFORM];
	unsigned int m_VariationCount;
};

__global__ void RunIters(unsigned int count)
{
	CuFlamePoint firstPoint, secondPoint;
	CuFlameXform theXform(1, -0.681206f, -0.0779465f, 0.20769f, 0.755065f, -0.0416126f, -0.262334f);
	theXform.AddVariation(Linear, 0.2f);
	theXform.AddVariation(Sinusoidal, 0.3f);
	theXform.AddVariation(Spherical, 0.5f);
	firstPoint.m_X = 0.3;
	firstPoint.m_Y = -0.8;
	firstPoint.m_ColorX = 0.44;
	
	for (unsigned int round = 0; round < count; round++)
	{
		theXform.Apply(&firstPoint, &secondPoint);
		firstPoint = secondPoint;
	}
}

extern "C"
void CudaCallWrapper(unsigned int i)
{
	RunIters<<<16, 16>>>(i);
	cudaDeviceSynchronize();
}

Main.cpp:

// includes, system
#include <iostream>
#include <stdlib.h>

// Required to include CUDA vector types
#include <cuda_runtime.h>
#include <vector_types.h>

#include "Timing.h"

extern "C" void CudaCallWrapper(unsigned int i);

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
	Timing t;
	CudaCallWrapper(5);
	t.Toc("CudaCallWrapper(5)");
	t.Tic();
	CudaCallWrapper(100);
	t.Toc("CudaCallWrapper(100)");
	t.Tic();
	CudaCallWrapper(500);
	t.Toc("CudaCallWrapper(500)");
	t.Tic();
	CudaCallWrapper(100000);
	t.Toc("CudaCallWrapper(100000)");

	return 0;
}

I’ve done extensive benchmarking and here are the results for anyone interested:

-Virtual functions and function pointers are laughably, embarrassingly slow. They are so slow, that I believe they should be excluded from the CUDA compiler altogether. Anything that has 300% more overhead simply has no place in GPU programming. If a person can accept something 300% slower than a normal function call, they should just stick to using the CPU.

-For performance critical code, only runtime code generation will work. Anything else simply doesn’t run fast enough.

I will begin researching runtime code generation/compilation.

I’ve looked high and low, and it looks like this is not possible without writing the whole program in Python and using PyCUDA, which is not an option for me.

I’m going to begin researching OpenCL.

Request for nVidia: Please add this to a future version of CUDA. Thanks.

Just as an FYI, device function calls that can be resolved at compile time in CUDA are pretty much guaranteed to be inlined all the time by the compiler. What I suspect you are benchmarking is the difference between inline “function calls” and non-inline, real function calls. And you are right, the difference is huge…

Early versions of the CUDA compiler actually had no way to issue a function call in the traditional sense (with a stack and all that), which prevented support for virtual functions, function pointers, and recursion. Now those features have been added, but the cost compared to an inline function is quite high. It’s hard to reduce that cost without trading away some of the GPU throughput.

PyCUDA (which I use a lot) doesn’t do anything particularly magical for dynamic code generation. It simply dumps the CUDA C source into a temp file, calls NVCC to generate a .cubin, and then loads it automatically. As a result, PyCUDA requires the CUDA toolkit to be installed. To speed things up, PyCUDA also uses a .cubin cache. You could do the same thing in your own program as long as you get to assume that NVCC is present.

I do agree it would be great to have a CUDA C compiler available in the CUDA driver API, just like the OpenCL compiler is available. The current CUDA driver API provides a PTX compiler if you can dynamically generate low-level code. There is also a separately available LLVM-based compiler infrastructure that you can get here if you are a registered developer:

https://developer.nvidia.com/cuda-llvm-compiler