Is there a function polymorphism trick in device code ?

spy · May 12, 2011, 2:22pm

Hi everyone,

I am working on a cuda raytracer for non-linear raytracing. The propagation of light rays is described by physical quantities based on different types of physical models called “metrics”.

The calculation of the needed quantities are described in a class “metric” which provides methods that are equivalent for every physical model and methods that are specific to a single model.

The best way to implement this behaviour would be using a base class providing the member functions which are the same for every metric and deriving specialized classes from that base class by reimplementing virtual member functions, which differ for different metrics.

Since i am using a Nvidia Gpu with compute capability 1.3 neither function pointers nor Inheritance/Polymorphism is supported ( as far as i know ;-)

My solution was to use one class with a type attribute to decide which special function must be used for the calculation of the physical quantities :

// type of the metric/physical model

typedef enum

{

   enum_metric_1,

   enum_metric_2,

} 

enum_metric_type;

// calulation of physical quantities related to one metric/physical model

__host__ __device__ static void metric_1_calcData( double * data );

__host__ __device__ static void metric_2_calcData( double * data );

// metric class

class metric

{

   private :

   enum_metric_type   mType;

   double             mCommonData[4];

   double             mSpecificData[4];

public :

   __host__ __device__ void calcCommonData() {

      // do some stuff that is the same for every metric

      mCommonData[0] = .....

   };

__host__ __device__ void calcSpecificData() {

      // calculation depends on the type of the metric

      if( mType == enum_metric_1 )

         // call the function which is related to metric_1

	 metric_1_calcData( mSpecificData );

      else if( mType == enum_metric_2 )

         // call the function which is related to metric_2

	 metric_2_calcData( mSpecificData );

   };

};

Though this works well for me, i’m not very happy with this solution.

The problem i was experiencing with this solution using if-branches is an increasing compilation time and executable size for greater numbers of metrics and large specialized functions ( metric_x_calcData() ). Because nvcc is inlining all function code ( including the multiple if-branches ) i get huge kernels. In my project i ran into compiler errors concerning the optimization limit, which could only be resolved by the nvcc-flag “–opencc-options -OPT:Olimit=0” or commenting out some of the if branches.

Though i think there is no other solution except compiling seperate kernels for every metric, i would be happy to know if theres some guru out there who solved the same kind of problem in a more elegant way …

kind regards

hyqneuron · May 13, 2011, 3:46pm

Use jump to go to relative address inside the same kernel. Use shared memory or registers to pass the parameters.

spy · May 16, 2011, 10:42am

Ohh, thanks for your reply

but i didn’t get it, could you be more specific ?

I have never heard of “jump” and “relative addresses” in CUDA Kernels (Untill now i only used the high level runtime API). Are you talking about assembler instructions or is it possible to take adresses of device functions ?

(compute capability 1.3 / SM 1.3 !!! )

if i instanciate a metric object in device code, its member like *mSpecificData should be located in the multi-processer-registers.

what do you mean by “use registers to pass the parameters” ?

Sorry i appreciate you help, but i have no idea what is in your mind.

Could you write a minimal example or post a link to some explanation of what you meant ?

kind regards

hyqneuron · May 17, 2011, 1:29pm

Sorry, it’s my bad… Wasn’t perfectly sure of what CC1.3 was capable of.

I can’t think of any good way to eliminate the ifs, though to make it faster you might want to use binary search.

Are you saying that all the metric-specific functions get inlined everytime you call calcSpecificData? If that’s the case, it would be the problem of the CUDA C compiler. Using PTX directly(use call) will solve this problem.

Despite the fact that CC1.3 does not support indirect branching/call at all, I’m still not convinced that eliminating all the ifs is impossible. However, the method I’m thinking of is certainly not elegant.

tera · May 17, 2011, 1:38pm

Elimination of all the [font=“Courier New”]if[/font]s is certainly possible using a [font=“Courier New”]switch[/font] statement. External Image It’s then up to the compiler to generate the best code possible.

hyqneuron · May 17, 2011, 3:37pm

Nice one tera…External Image

Gregory_Diamos · May 17, 2011, 4:16pm

True, but as far as I can tell the current release of the compiler doesn’t do this, and you get a chain of comparisons/branches.

You can also do the optimization by hand using a table of function pointers in local memory and then using the switch condition as the index into the table.

spy · May 17, 2011, 9:00pm

Thanks for all you answers !

…the code itself isn’t that slow, my problem is the long compiling/assembling-time (up to several minutes, which overloads the compiler-optimization) and the distributed code … I want the possibility that other users can write their own metrics, and the class structure should be as simple as possible …

I think thats the case, my project consists of 5 kernels, where only one of them makes intensive calls to the specific functions. If i comment out the code in this kernel if get a executable size of 500 Kb … If not commented the size reaches 5 MB. I think this fact supports the assumption of strict inlining. but you’re right, the compiler normally should define functions in assembler code and use jump to execute that code … but modifying the assembler code would be too much effort concerning my knowledge and the projects needs.

I was brainstorming for a long time too, thought of using different file types and template functions but in the end the problem stays the same. NVCC does not know in advance, what type of metric will be passed to the kernel and in case, that function pointers are not supported the compiler has to prepare all specific code at the point of the function call.

The only possibility i see to overcome my problem is to use preprocessor directives and compile the kernel for different metrics into different .cubin modules and load the right module via driver api at the point of the kernel call where the actual type of the metric is known …

hawhaw … for a short time i thought about that, cause i’m often in the situation using if-else if constructs or switches … in the end it’s just sematics isn’t it ??

Be aware of ComputeCapability 1.3, function pointers would solve my problem but i don’t think that ncvv accepts them in any scope !

tera · May 18, 2011, 1:08am

If your code grows as much as that, you are probably using the functions in a number of places. In that case you might be better off using a template to actually build different kernels for the different metrics. While that would increase the code size even more, each individual kernel would be smaller and that would likely help the optimizer. Also you save the evaluation of the chains of conditionals at runtime, and it would make it easier for the user to use custom metrics.

Of course this only works if the metric is the only user-selectable function. Otherwise the number of possible kernels would just explode.

But there is no call stack and no assembler instruction for indirect jumps on 1.x devices. So you could only achieve that through self-modifying code.

That would be a good option, similar to the template method. Should be a lot easier in 4.0 where you can mix runtime and driver API.

Well on the CPU this should generate a jump table with any decent compiler. Of course that’s not possible on 1.x devices, but I would still expect the compiler to generate a binary tree. Although I haven’t checked myself and Gregory’s comment makes me less optimistic now.

hyqneuron · May 18, 2011, 3:38am

If you have a large number of different metrics in the same environment you will need to do a lot of different kernel launches, which would involve some overhead.

But of course, if at any one time you code only needs to deal with one type of metric, then I think the pre-processing method would be the best. I hope this is the case for you.

Btw, adding the call instruction in your PTX code should be very easy to do. (from what the ptx manual says about the call instruction, I think there is a call register that stores the instruction address after the call instruction. And the ret instruction would go back to that address stored in the register. So despite the lack of a call stack, a function call of 1 depth should still be supported)
To add in the calls in your ptx, you just need to compile all your metric-specific functions into individual PTX functions and replace all the function calls in your CUDA C code with __asm(“call funcx;”); All these are easy with NVCC. But still, when you want to add new metric types, you always would have to recompile things. And there is a way to modify you cubin such that you don’t have to recompile everything. The time for this modification should not be longer than a second.

spy · May 23, 2011, 5:04pm

Yepp … nice sugesstion, i allready started working on that but getting stuck in confusion and further problems :-(

the problem in my case is that templates only support static polymorphism and no dynamic (while-runtime) polymorphism.

i can declare metric templates classes using functor classes

// functor classes for specific metrics

class metric1

{

     public :

     __host__ __device__ bool calcSpecific ( ... )

};

class metric2

{

     public :

     __host__ __device__ bool calcSpecific ( ... )

};

// Metric template class

template<class MetricType>

class Metric

{

     private :

     MetricType     mType;

public :

     __host__ __device__ bool calcSpecific ( ... )

     {

         return mType.calcSpecific( ... );

     }

};

// now i can construct a specific metric class by

Metric<metric1> Metric1;

And the ray tracing kernels could be declared as template functions depending on a specific metric.

But now i have no base class which allows me to instantiate a object with no previously known type.

I am planning to setup all parameters using the LUA scripting language and i think there is no solution

to the following C++like approach.

// type must be fixed/constant - variable template parameters not supported

Metric<which Type !?!!> * CurrMetric = (which return value !?!?) LUA_getMetric();

this really drives me crazy, i’m totally puzzled by hours braining on a way to get through all that …

still the only solution … is mixing of RuntimeAPI and DriverAPI calls not possible in cuda-3.2 ?? … i thought i saw an example in the SDK

yes the whole process is bound to one metric untill i switch to another …

Btw, adding the call instruction in your PTX code should be very easy to do. (from what the ptx manual says about the call instruction, I think there is a call register that stores the instruction address after the call instruction. And the ret instruction would go back to that address stored in the register. So despite the lack of a call stack, a function call of 1 depth should still be supported)

To add in the calls in your ptx, you just need to compile all your metric-specific functions into individual PTX functions and replace all the function calls in your CUDA C code with __asm(“call funcx;”); All these are easy with NVCC. But still, when you want to add new metric types, you always would have to recompile things. And there is a way to modify you cubin such that you don’t have to recompile everything. The time for this modification should not be longer than a second.

thanks for that one … though i think this will get me end up in further problem i’ll keep that in mind.

The troubles i see are:

Is is possible to asemble *.cu to *.ptx with no kernels (only device functions)

then i would be able to replace all calls to specific functions calcSpecific(…); to __asm("call 'ok, who has the name ;-) '; ");

how can i include the pre-assembled ptx in the ptx of the main code

thanks guys !

hyqneuron · May 24, 2011, 8:40am

1st qncompile it into ptx for sm_20. That way all the device functions get outputted as separate .visible functions. Then you just have to change the .target

2nd qn: You’ll have to build a cubin and modify it directly

the cubin has to be built in this way:

only one global kernel, in which all functions get called // so that ptx code does not get omitted.
at the very beginning of the kernel, use a conditional jump to the end of the kerne. make sure the conditional jump is always taken.
at the end of the kernel, insert your main kernel code

of course, the above three steps are done with ptx directly. to get the ptx code for your main kernel you just put in the asm(“call funx;”); where applicable and build it into a ptx file.

Of course, the above mentions nothing about pre-compilation. To do that, you’ll build the above cubin first and modify the cubin directly every time you have a change in your main kernel.

put plenty of definite and different conditional jumps to a ret instruction at the end of your main kernel to reserve some space for your main kernel. build your main kernel separately into a cubin…
sorry i’m running out of time. I still don’t know whether sm1x has the functions as symbols or not… anyway you’ll need asmcuda to do that. i’ll come back and explain more if you really need…

spy · May 30, 2011, 12:57pm

Hi guys !

thanks for your help but i think i found a solution to my problem using template classes.

For everyone who is still interested i wrote a detailed minimal example.

pros :

[*] By the use of template classes [font=“Courier New”]Class[/font] one can derive specialized classes [font=“Courier New”]Class<1> Class<2> …[/font] which allow inheritance of attributes and methods from the main template [font=“Courier New”]Class[/font] and [font=“Courier New”][virtual][/font] behavior by spezialization/reimplementation of methods for the spezialized classes [font=“Courier New”]Class[/font].

[*] Inspired by “Loop Unrolling with template classes” i got it working to have a base class [font=“Courier New”]Class<0>[/font] in host code, which is able to return methods of spezialized classes depending on a private attribute [font=“Courier New”]unsigned int mType[/font]

[*] Now the kernels can be implemented as template functions to a fixed type [font=“Courier New”]T[/font].

[*] A user can derive new classes by simply increasing a define ([font=“Courier New”]NUM_TYPES[/font] in my case) and implementing the specialized methods.

contras :

[*] Complex and harder to read code

[*] No compiling of functions with extern linkage ( template specializations have to be included/inlined ).

[*] Every [font=“Courier New”]OtherClass[/font] that needs a pointer to [font=“Courier New”]Class[/font] needs to be a template class [font=“Courier New”]OtherClass[/font] itself (in host code one can always use a base class pointer [font=“Courier New”]Class<0> * ptr[/font], but for classes used inside a [font=“Courier New”]kernel[/font] this restriction applies).

This is a rather long “minimal” example so you have to Unroll it ;-) [spoiler]

Metric.h :

#define _h__ __host__

#define _hd_ __host__ __device__

// Template argument "MetricType"

typedef unsigned int MetricType;

   #define Base  0

   #define Type1 1

   #define Type2 2

#define NUM_TYPES 3

// Use type "Base" by default

template<MetricType T = Base>

class Metric

{

   public : //......... Constructor

   _hd_                 Metric();

   _h__                 Metric( MetricType Type );

public : //......... Public Methods

   _h__            void setType( MetricType Type );

   _hd_      MetricType getType();

   _hd_    const char * getName();

private : //........ Specialized Methods

   _hd_    const char * SpecializedName();

private : //........ Private Attributes

   MetricType           mType;

};

// we have to include all template definitions in the

// header because they are only compiled as needed !!

#include "Metric.inl"

The file Metric.inl is only a placeholder for including the various definitions

#include "Metric.h"

// Metric<Base> definitions

#include "MetricBase.inl"

// Add custom specializations here

#include "MetricType1.inl"

#include "MetricType2.inl"

The specialization/reimplementation for the MetricTypes [font=“Courier New”]Type1[/font] and [font=“Courier New”]Type2[/font] follows straight forward:

( here the [font=“Courier New”]inline[/font] qualifier is needed to avoid a “multiple definitions” linker error when dealing

with several objects files )

MetricType1.inl :

template<> 

inline __host__ __device__ Metric<Type1>::Metric()

{

   mType=Type1;

}

template<> 

inline __host__ __device__ const char * Metric<Type2>::SpecializedName()

{

   return "Metric1 [reimplemented]";

}

MetricType2.inl :

template<> 

inline __host__ __device__ Metric<Type2>::Metric()

{

   mType=Type2;

}

template<> 

inline __host__ __device__ const char * Metric<Type2>::SpecializedName()

{

   return "Metric2 [reimplemented]";

}

The basic magic happens in the definitions of the public methods

These are the same for every [font=“Courier New”]Metric[/font] instance. For [font=“Courier New”][virtual][/font] behavior Loop unrolling allows to access the (private) specialized methods depending on the [font=“Courier New”]MetricType T[/font] (see [font=“Courier New”]Metric::getName()[/font]) stored in [font=“Courier New”]mType[/font] by incrementing the template from [font=“Courier New”]Metric[/font] to [font=“Courier New”]Metric<T+1>[/font] (for template arguments T only compile time constants are allowed - NO variables !!)

MetricBase.inl :

// BASE: Standard constructor with no arguments

template<MetricType T> 

__host__ __device__ Metric<T>::Metric()

{

   mType = T;

}

// BASE: Typeset constructor

template<MetricType T> 

__host__ Metric<T>::Metric( MetricType Type )

{

   mType = Type;

}

// BASE: Set metric type

template<MetricType T> 

__host__ void Metric<T>::setType( MetricType Type)

{

   mType = Type;

}

// BASE: Get metric type

template<MetricType T> 

__host__ __device__ MetricType Metric<T>::getType()

{

   return mType;

}

/* ----------------------------------------------------- *

 * BASE : Get metric names

 * ----------------------------------------------------- */   

template<MetricType T> 

__host__ __device__ const char * Metric<T>::getName()

{

   #if defined(__CUDACC__)

// DEVICE : We will only use specialized class instances

      // Metric<MetricType T> [with T > 0] in device code !!

return SpecializedName();

#else // !defined(__CUDACC__)

// HOST : In host code we will always use a base instance 

      // Metric<Base> [T = 0] and use recursive loop unrolling over the 

      // template class argument T :

if( T < mType )

      {

         // cast "this"-pointer to a template of argument T+1

         // and do recursive call of getName() ...

         Metric<T+1> * NextMetric = (Metric<T+1> *)this;

         return NextMetric->getName();

      }

      else

         // ... until T is desired type (T == mType).

         // now return specialization !

         return SpecializedName();

      }

   #endif

}

// The compiler assumes mType to have an arbitrary value in the type 

   // range of MetricType (unsigned int). To prevent the compiler from

   // infinite recursion ( Metric<T>::getName() has to be compiled

   // seperately for every T ), we have to stop it at the end of valid

   // metric types < NUM_TYPES >.

template<> 

inline __host__ const char * Metric< NUM_TYPES >::getName()

{

   // if this specialization is was reached something went wrong

   return "error in unrolling metric types";

}

// BASE: Get metric type

template<MetricType T> 

__host__ __device__ const char * Metric<T>::SpecializedName()

{

   // this function must be reimplemented in specialized templates

   return "Base [to be reimplemented]";

}

Assuming we implemented a Kernel

template<MetricType T> __global__ void Kernel(Metric<T> Metric)

we can call the right version by applying the same strategy used in [font=“Courier New”]Metric::getName()[/font].

device.cu

// forward declaration

template<MetricType T> KernelUnroller( Metric<T> * MetricT );

// this wrapper calls the kernel from host code

extern "C" void KernelWrapper( Metric<Base> * MetricBase )

{

   KernelUnroller( MetricBase );

}

template<MetricType T>

void KernelUnroller( Metric<T> * MetricT )

{

   if ( T < MetricT->getType() )

   {

      // Template has not the right type ... 

      // increment type and recurse !!

      Metric<T+1> * nextMetric = (Metric<T+1> *) MetricT;

      KernelUnroller( nextMetric );

   }

   else if( T == MetricT->getType() )

   {

      // Template has the desired type ... call template kernel

      // ( TODO: MetricT has to be converted to a device pointer )

      Kernel<<<GRID_SIZE,BLOCK_SIZE>>>( MetricT );

   }

   else

   {

      // For safety ... if type exceeds defined types, exit(fatal) !

      exit( EXIT_FAILURE );

   }

}

This means major modifications on my project, but i think i’ll reimplement everything using the shown strategy. This was proofed working in host code and tested wth a fixed type T on device. The time needed to compile the problematic kernel was reduced from 3 minutes for “if-branches” to 1:30 using the template unrolling method.

If you have suggestion for improvement or questions feel free to ask.

[/spoiler]

yours, spy !

devkec · May 31, 2011, 8:41am

[quote name=‘spy’ date=‘30 May 2011 - 01:57 PM’ timestamp=‘1306760260’ post=‘1244308’]