Device function pointers: Is it possible to use them in a useful way?

I noticed that it seems to be impossible to use device function pointers in way it makes sense for C++ users. If I don’t make function declarations global, but put them into a class or struct, I get runtime errors (example below). Is something like below anyhow possible with the CUDA compiler?
If the example below is really impossible, what is the current benefit of device function pointers, beside making the code longer?

class layout:

typedef float (*pDistanceFu) (float, float);
typedef float (*pDecayFu)    (float, float, float);

class DistFunction {
public:
	// Assignment of the function pointers in device space. Gets called by the ctor!
	void DeviceAssign();
public:
	DistFunction(char *, pDistanceFu, pDecayFu, pDecayFu);
	
	char *name;
	pDistanceFu distance;
	pDecayFu rad_decay;
	pDecayFu lrate_decay;
};

class implementation:

DistFunction::DistFunction(char *cstr, pDistanceFu dist, pDecayFu rad_dec, pDecayFu lrate_dec) : 
	name(cstr), 
	distance(dist), 
	rad_decay(rad_dec), 
	lrate_decay(lrate_dec) 
{
	DeviceAssign();
}

void DistFunction::DeviceAssign() {
	pDistanceFu hDistance; 
	pDecayFu hRadDay; 
	pDecayFu hLRateDecay; 
	
	cudaMemcpyFromSymbol(&hDistance, distance, sizeof(pDistanceFu) );
	cudaMemcpyFromSymbol(&hRadDay, rad_decay, sizeof(pDecayFu) );
	cudaMemcpyFromSymbol(&hLRateDecay, lrate_decay, sizeof(pDecayFu) );
	
	distance = hDistance;
	rad_decay = hRadDay;
	lrate_decay = hLRateDecay;
}

class initialization:

#ifdef __CUDACC__
	__host__ __device__
#endif
inline static float
fcn_rad_decay (float sigma0, float T, float lambda) {
	return std::floor(sigma0*exp(-T/lambda) + 0.5f);
}

// and so on ..

DistFunction DistFunctions::fcn_bubble = DistFunction(
	(char*)"bubble",
	fcn_bubble_nhood,
	fcn_rad_decay,
	fcn_lrate_decay);

If you want to provide a short, complete example that I can compile as-is, I’ll take a look.

To use cudaMemcpyFromSymbol as in this example:

cudaMemcpyFromSymbol(&hDistance, distance, sizeof(pDistanceFu) );

distance had better be a proper device symbol, and the name of a function decorated with device does not qualify for that. The relevant parameter in your snippet seems to be fcn_bubble_nhood, but I can’t tell what that is exactly.

This SO answer may be of interest:

http://stackoverflow.com/questions/31057870/passing-host-function-as-a-function-pointer-in-global-or-device-function/31058123#31058123

since it links to a number of SO questions and answers about using function pointers. In particular, the last one linked in the comments:

http://stackoverflow.com/questions/34879789/thrust-transform-throws-error-bulk-kernel-by-value-an-illegal-memory-access-w

offers two relatively concise methods for grabbing a device function address for use as a function pointer. Even though it has thrust in view, the two suggested methods for “capture” of a device function address are equally applicable to ordinary CUDA C/C++ codes.

I can’t tell from your question, but by using device function pointers I assume you’re trying to make it easier to change the behavior of your kernel at runtime or, maybe, at link time?

My experiments with pointers to functions in CUDA haven’t been very encouraging. A simple switch seems to generate better performance. There was a discussion here a long time ago.

Does anyone have a good example of pointers to CUDA device functions resulting in a performance or complexity “win”?

Another alternative is linking device functions to your kernel skeleton: CUDA Pro Tip: Use cuFFT Callbacks for Custom Data Processing.

It’s only semi-related to your question, but have you looked at C++11 lambda support in CUDA 7 and 8?

CUDA 8 Features Revealed and CUDA 7 and Beyond.

Lambdas might partially achieve your goal and possibly simplify your code.

A function pointer breaks the ability of the CUDA compiler to optimize across call boundaries. Such constructs cannot be inlined, obviously. Therefore other approaches that allow the compiler to aggressively inline will likely produce faster code in some cases (perhaps many/most cases).

It’s hard to imagine how function pointers could ever be a performance win, regardless of architecture. But perhaps I need to be educated.

BTW OP has a cross-posting here which has slightly different code and other comments:

http://stackoverflow.com/questions/36953612/assignement-of-device-function-pointers-in-cuda-from-host-function-pointers

(such as the point about using CUDA runtime API in constructors for objects that may be at global scope? – perhaps that is the crux of the issue)

Yup, I understand all that but I was seeing some performance-killing spills – beyond what was expected.

Ah, if performance matters it might be simpler for the OP to just put the various funcs in a switch and pass flags to the kernel:

enum fcn_distance_type { bubble_nhood, gaussian_nhood, cutgaussian_nhood, mexican_nhood, epanechicov_nhood };
enum fcn_decay_type    { rad_decay, lrate_decay };

Wouldn’t a non-inlined “opaque” function call pretty much imply that all registers currently in use by the thread would have to be effectively “spilled”?

What other options would the compiler have?

If all the device functions were to operate on a known fixed-size, stationary, absolutely-known-location block of registers then calling them at runtime should be a push/pop of the PC and nothing more.

If it were hand-coded SASS I could probably get it to work.

The switch dispatch works fine though.

I’m undoubtedly asking too much from the compiler and language. Perhaps a CUDA extension with syntax similar to this could enforce this behavior. Some sort of file scoped name of a region of the register file that all device functions can operate on.

Actually, I don’t care about performance, as setting the function pointer takes place in the host code and my library is already hundreds of times faster in comparison of a CPU implementation.

distance had better be a proper device symbol, and the name of a function decorated with device does not qualify for that. The relevant parameter in your snippet seems to be fcn_bubble_nhood, but I can’t tell what that is exactly.

This seems exactly what I need. I have a structure with function pointers. I cannot use function references to create CUDA device function pointers in the e.g. CTOR of this class. I will post here a compiling demo soon…

I think it is a compiler issue only … Is there any chance NVIDIA can solve this?

Here guys: https://github.com/dgrat/cudaFPdemo

This is a minimal demo, which breaks at run time, maybe someone smarter than me can make a better demo of function pointers :) And btw. I don’t care about performance, as my library is already hundreds of times faster and can process a 4k picture in the ms scale.

A few points that have been stated already above or in various links:

  1. The address of a device function, or a device symbol, cannot be taken in host code. Therefore the address of a device function must first be transferred to a device symbol variable, and then the contents of that device symbol variable can be transferred to a host copy/version of that variable. Note that this seems to also preclude the passing of a device symbol by-value as a function argument.

  2. CUDA API routines such as cudaMemcpyFromSymbol cannot be safely called in constructors (or destructors) of objects at global scope. This is due to CUDA lazy initialization.

With the above provisos in mind, one possible approach to allow for your code to compile and run without error is as follows:

$ cat Functions.h
/*
#-------------------------------------------------------------------------------
# Copyright (c) 2012 Daniel <dgrat> Frenzel.
# All rights reserved. This program and the accompanying materials
# are made available under the terms of the GNU Lesser Public License v2.1
# which accompanies this distribution, and is available at
# http://www.gnu.org/licenses/old-licenses/gpl-2.0.html
#
# Contributors:
#     Daniel <dgrat> Frenzel - initial API and implementation
#-------------------------------------------------------------------------------
*/

#ifndef TRANSFERFUNCTIONS_H_
#define TRANSFERFUNCTIONS_H_

#ifndef SWIG
#include <cmath>
#include <stdio.h>
#include <string.h>
#endif

#define PI    3.14159265358979323846f



typedef float (*pDistanceFu) (float, float);
typedef float (*pDecayFu) (float, float, float);


//////////////////////////////////////////////////////////////////////////////////////////////
/*
 * Distance functions for self organizing maps
 */
//////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_bubble_nhood (float dist, float sigmaT) {
        if(dist < sigmaT)
                return 1.f;
        else return 0.f;
}

//////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_gaussian_nhood (float dist, float sigmaT) {
        return exp(-pow(dist, 2.f)/(2.f*pow(sigmaT, 2.f)));
}
#ifdef __CUDACC__
       __device__
#endif
float (*d_fcn_gaussian_nhood)(float dist, float sigmaT) = fcn_gaussian_nhood;
//////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_cutgaussian_nhood (float dist, float sigmaT) {
        if(dist < sigmaT)
                return exp(-pow(dist, 2.f)/(2.f*pow(sigmaT, 2.f)));
        else return 0.f;
}


//////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_mexican_nhood (float dist, float sigmaT) {
        return  2.f/(sqrt(3.f * sigmaT) * pow(PI, 0.25f) ) *
                (1.f-pow(dist, 2.f) / pow(sigmaT, 2.f) ) *
                fcn_gaussian_nhood(dist, sigmaT);
}


//////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_epanechicov_nhood (float dist, float sigmaT) {
        float fVal = 1 - pow(dist/sigmaT, 2.f);
        if(fVal > 0)
                return fVal;
        else return 0.f;
}

//////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_rad_decay (float sigma0, float T, float lambda) {
        return std::floor(sigma0*exp(-T/lambda) + 0.5f);
}
#ifdef __CUDACC__
       __device__
#endif
float (*d_fcn_rad_decay)(float sigma0, float T, float lambda) = fcn_rad_decay;

//////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_lrate_decay (float sigma0, float T, float lambda) {
        return sigma0*exp(-T/lambda);
}
#ifdef __CUDACC__
       __device__
#endif
float (*d_fcn_lrate_decay)(float sigma0, float T, float lambda) = fcn_lrate_decay;

/**
 * @class DistFunction
 * @brief Represents a neighborhood and decay function.
 * Consists of a distance and a decay function.
 * Normally just the neighborhood function is free to be changed.
 */
class DistFunction;
typedef float (*pDistanceFu) (float, float);
typedef float (*pDecayFu) (float, float, float);
typedef float (DistFunction::*pmDistanceFu) (float, float);
typedef float (DistFunction::*pmDecayFu) (float, float, float);


class DistFunction {
private:
        pDistanceFu hDist;
        pDecayFu hRadDecay;
        pDecayFu hLRateDecay;

public:
        DistFunction(char *, pDistanceFu, pDecayFu, pDecayFu);
        void Assign();

        char *name;
        pDistanceFu distance;
        pDecayFu rad_decay;
        pDecayFu lrate_decay;
};

void test();

#endif /* TRANSFERFUNCTIONS_H_ */
$ cat Functions.cu
//#include <iostream>
#include "Functions.h"
#include <iostream>
#include <thrust/extrema.h>
#include <thrust/distance.h>
#include <thrust/device_vector.h>


DistFunction::DistFunction(char *cstr, pDistanceFu dist, pDecayFu rad, pDecayFu lrate) : name(cstr), distance(dist), rad_decay(rad), lrate_decay(lrate) {
}

void DistFunction::Assign() {
        pDistanceFu hDist;
        pDecayFu hRadDecay;
        pDecayFu hLRateDecay;

        cudaMemcpyFromSymbol(&hDist, distance, sizeof(pDistanceFu) );
        cudaMemcpyFromSymbol(&hRadDecay, rad_decay, sizeof(pDecayFu) );
        cudaMemcpyFromSymbol(&hLRateDecay, lrate_decay, sizeof(pDecayFu) );

        distance = hDist;
        rad_decay = hRadDecay;
        lrate_decay = hLRateDecay;
}


struct sm20lrate_decay_functor {
        float fCycle;
        float fCycles;
        DistFunction m_pfunc;

        sm20lrate_decay_functor(const DistFunction &pfunc, float cycle, float cycles) : m_pfunc(pfunc), fCycle(cycle), fCycles(cycles) {}

        __host__ __device__
        float operator()(float lrate) {
                return (m_pfunc.lrate_decay)(lrate, fCycle, fCycles);
        }
};

void test() {
        unsigned int iWidth     = 4096;
        thrust::device_vector<float> dvLearningRate(iWidth, 0.f);
        thrust::device_vector<float> dvLRate(iWidth, 0.f);
        float (*h_fcn_gaussian_nhood)(float dist, float sigmaT);
        float (*h_fcn_rad_decay)(float sigma0, float T, float lambda);
        float (*h_fcn_lrate_decay)(float sigma0, float T, float lambda);
        cudaMemcpyFromSymbol(&h_fcn_gaussian_nhood, d_fcn_gaussian_nhood, sizeof(pDistanceFu) );
        cudaMemcpyFromSymbol(&h_fcn_rad_decay, d_fcn_rad_decay, sizeof(pDecayFu) );
        cudaMemcpyFromSymbol(&h_fcn_lrate_decay, d_fcn_lrate_decay, sizeof(pDecayFu) );


DistFunction fcn_gaussian = DistFunction(
        (char*)"gaussian",
        h_fcn_gaussian_nhood,
        h_fcn_rad_decay,
        h_fcn_lrate_decay
);

        thrust::transform( dvLRate.begin(),
                dvLRate.end(),
                dvLearningRate.begin(),
                sm20lrate_decay_functor(fcn_gaussian, 1, 100) );
}

$ nvcc -o f Functions.cu main.cpp -std=c++11
$ cuda-memcheck ./f
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

This is roughly following the first approach outlined here:

http://stackoverflow.com/questions/34879789/thrust-transform-throws-error-bulk-kernel-by-value-an-illegal-memory-access-w

I believe it should be possible to use the second approach (templating) as well, although I haven’t worked through an application of that to your code.

I’m not sure I entirely understand your style/usability complaints, so I can’t comment on whether or not such modifications meet your sense of decorum. I suspect your objections revolve around passing a device symbol as a function argument.

Thanks for the help, your solution 1) is actually what I do atm. Nevertheless, you already see that this solution is not very nice for me, because my goal was to make my code simpler and more flexible.
If I cannot create device function pointers in a CTOR, or use normal function pointers for creating them,
it looks now to me that CUDA device function pointers are entirely useless. Correct me if I am wrong, but it looks for me like one could replace them IN ANY CASE with normal device functions.

Maybe the template approach can work, thanks for the link :)

Perhaps you’ll like this version better, since we can pass device symbols by reference for use in the constructor:

$ cat Functions.h
/*
#-------------------------------------------------------------------------------
# Copyright (c) 2012 Daniel <dgrat> Frenzel.
# All rights reserved. This program and the accompanying materials
# are made available under the terms of the GNU Lesser Public License v2.1
# which accompanies this distribution, and is available at
# http://www.gnu.org/licenses/old-licenses/gpl-2.0.html
#
# Contributors:
#     Daniel <dgrat> Frenzel - initial API and implementation
#-------------------------------------------------------------------------------
*/

#ifndef TRANSFERFUNCTIONS_H_
#define TRANSFERFUNCTIONS_H_

#ifndef SWIG
#include <cmath>
#include <stdio.h>
#include <string.h>
#endif

#define PI    3.14159265358979323846f



typedef float (*pDistanceFu) (float, float);
typedef float (*pDecayFu) (float, float, float);


//////////////////////////////////////////////////////////////////////////////////////////////
/*
 * Distance functions for self organizing maps
 */
//////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_bubble_nhood (float dist, float sigmaT) {
        if(dist < sigmaT)
                return 1.f;
        else return 0.f;
}

//////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_gaussian_nhood (float dist, float sigmaT) {
        return exp(-pow(dist, 2.f)/(2.f*pow(sigmaT, 2.f)));
}
#ifdef __CUDACC__
       __device__
#endif
float (*d_fcn_gaussian_nhood)(float dist, float sigmaT) = fcn_gaussian_nhood;
//////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_cutgaussian_nhood (float dist, float sigmaT) {
        if(dist < sigmaT)
                return exp(-pow(dist, 2.f)/(2.f*pow(sigmaT, 2.f)));
        else return 0.f;
}


//////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_mexican_nhood (float dist, float sigmaT) {
        return  2.f/(sqrt(3.f * sigmaT) * pow(PI, 0.25f) ) *
                (1.f-pow(dist, 2.f) / pow(sigmaT, 2.f) ) *
                fcn_gaussian_nhood(dist, sigmaT);
}


//////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_epanechicov_nhood (float dist, float sigmaT) {
        float fVal = 1 - pow(dist/sigmaT, 2.f);
        if(fVal > 0)
                return fVal;
        else return 0.f;
}

//////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_rad_decay (float sigma0, float T, float lambda) {
        return std::floor(sigma0*exp(-T/lambda) + 0.5f);
}
#ifdef __CUDACC__
       __device__
#endif
float (*d_fcn_rad_decay)(float sigma0, float T, float lambda) = fcn_rad_decay;

//////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_lrate_decay (float sigma0, float T, float lambda) {
        return sigma0*exp(-T/lambda);
}
#ifdef __CUDACC__
       __device__
#endif
float (*d_fcn_lrate_decay)(float sigma0, float T, float lambda) = fcn_lrate_decay;

/**
 * @class DistFunction
 * @brief Represents a neighborhood and decay function.
 * Consists of a distance and a decay function.
 * Normally just the neighborhood function is free to be changed.
 */
class DistFunction;
typedef float (*pDistanceFu) (float, float);
typedef float (*pDecayFu) (float, float, float);
typedef float (DistFunction::*pmDistanceFu) (float, float);
typedef float (DistFunction::*pmDecayFu) (float, float, float);


class DistFunction {
private:
        pDistanceFu hDist;
        pDecayFu hRadDecay;
        pDecayFu hLRateDecay;

public:
        DistFunction(char *, pDistanceFu &, pDecayFu &, pDecayFu &);
        void Assign();

        char *name;
        pDistanceFu distance;
        pDecayFu rad_decay;
        pDecayFu lrate_decay;
};

void test();

#endif /* TRANSFERFUNCTIONS_H_ */
$ cat Functions.cu
//#include <iostream>
#include "Functions.h"
#include <iostream>
#include <thrust/extrema.h>
#include <thrust/distance.h>
#include <thrust/device_vector.h>


DistFunction::DistFunction(char *cstr, pDistanceFu &dist, pDecayFu &rad, pDecayFu &lrate) : name(cstr){
        cudaMemcpyFromSymbol(&distance, dist, sizeof(pDistanceFu) );
        cudaMemcpyFromSymbol(&rad_decay, rad, sizeof(pDecayFu) );
        cudaMemcpyFromSymbol(&lrate_decay, lrate, sizeof(pDecayFu) );
}

void DistFunction::Assign() {
        pDistanceFu hDist;
        pDecayFu hRadDecay;
        pDecayFu hLRateDecay;

        cudaMemcpyFromSymbol(&hDist, distance, sizeof(pDistanceFu) );
        cudaMemcpyFromSymbol(&hRadDecay, rad_decay, sizeof(pDecayFu) );
        cudaMemcpyFromSymbol(&hLRateDecay, lrate_decay, sizeof(pDecayFu) );

        distance = hDist;
        rad_decay = hRadDecay;
        lrate_decay = hLRateDecay;
}
/*
DistFunction fcn_gaussian = DistFunction(
        (char*)"gaussian",
        fcn_gaussian_nhood,
        fcn_rad_decay,
        fcn_lrate_decay
);
*/


struct sm20lrate_decay_functor {
        float fCycle;
        float fCycles;
        DistFunction m_pfunc;

        sm20lrate_decay_functor(const DistFunction &pfunc, float cycle, float cycles) : m_pfunc(pfunc), fCycle(cycle), fCycles(cycles) {}

        __host__ __device__
        float operator()(float lrate) {
                return (m_pfunc.lrate_decay)(lrate, fCycle, fCycles);
        }
};

void test() {
        unsigned int iWidth     = 4096;
        thrust::device_vector<float> dvLearningRate(iWidth, 0.f);
        thrust::device_vector<float> dvLRate(iWidth, 0.f);


DistFunction fcn_gaussian = DistFunction(
        (char*)"gaussian",
        d_fcn_gaussian_nhood,
        d_fcn_rad_decay,
        d_fcn_lrate_decay
);

        thrust::transform( dvLRate.begin(),
                dvLRate.end(),
                dvLearningRate.begin(),
                sm20lrate_decay_functor(fcn_gaussian, 1, 100) );
}

$ nvcc -o f Functions.cu main.cpp -std=c++11
$ cuda-memcheck ./f
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

This still requires the constructor to be run at main scope rather than global scope, and it requires a mechanical addition for each desired function to provide a device variable for storage of the function pointer.

Thanks you guys. I learned that everything must by known at compile time. Thus, device function pointers are a lie of nvidia to trick us developers :D Check this template magic below. I will try and check whether it works…
Let’s thank the compiler gods that templates are turing complete and that we don’t use OpenCL (Joke but entirely true)

typedef float (*pDistanceFu) (float, float);
typedef float (*pDecayFu) (float, float, float);

template <pDistanceFu Dist, pDecayFu Rad, pDecayFu LRate>
class DistFunction {	
public:
	DistFunction() {}
	DistFunction(const char *cstr) : name(cstr) {};
	
	const char *name;
	
	#ifdef __CUDACC__
		__host__ __device__
	#endif
	static float distance(float a, float b) { return Dist(a,b); };
	#ifdef __CUDACC__
		__host__ __device__
	#endif
	static float rad_decay(float a, float b, float c) { return Rad(a,b,c); };
	#ifdef __CUDACC__
		__host__ __device__
	#endif
	static float lrate_decay(float a, float b, float c) { return LRate(a,b,c); };
};

The trick is that I define my functors as type definitions from a template class.
Everthing is known at compile time. I completely delete this function pointer lie.

template <class F>
struct functor {
  float fCycle;
  float fCycles;

  functor(float cycle, float cycles) : fCycle(cycle), fCycles(cycles) {}

  __host__ __device__
  float operator()(float lrate) {
    return F::lrate_decay(lrate, fCycle, fCycles);
  }
};

typedef DistFunction<fcn_gaussian_nhood,fcn_rad_decay,fcn_lrate_decay> gaussian;
void test() {
	functor<gaussian> test(0,1);
}

@Nvidia
Will it ever be implemented properly, or is it impossible?

@dgrat
You also can define if is compiled with nvcc or gcc/g++ instead always declaring #ifdef before a function, following this example:

#ifdef __CUDACC__
#define CUDA_CALLABLE_MEMBER __host__ __device__
#else
#define CUDA_CALLABLE_MEMBER
#endif

class Nos{
private:
    int min, max;

public:
    CUDA_CALLABLE_MEMBER Nos(){
        min = 0;
        max = 0;
    }
}

I believe you about function pointers breaking CUDA compiler optimization. I find myself using the block reduction pattern given here quite often, but need to change the warp reduction device sub-function. To that end, I templated the block reduction like so:

template <class T> 
using UnaryWarpReduction = T(*)(T);// e.g. WarpReduceSum

template <class T>
__inline__ __device__ T BlockReduce(T val, UnaryWarpReduction<T> unaryFcn ) {

static __shared__ T shared[32];
int lane = threadIdx.x % warpSize;
int wid = threadIdx.x / warpSize;

val = unaryFcn(val);

if (lane == 0)
	shared[wid] = val;

__syncthreads();

val = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0;

if (wid == 0)
	val = unaryFcn(val);

return val;
}

Which is then used in a kernel like so:

x = BlockReduce<float>(x, &WarpReduceSum);

What would be your recommended approach to replace/swap the device warp reduction sub-functions? Lambdas? Is the CUDA compiler able to optimize this example?