Why not just use a loop and skip the fancy templates?
Efficiency will be the same since CUDA will unroll the loop for sure, since VECTOR_SIZE is known at compile time.
BTW there’s nothing wrong with using templates (I use metaprogramming in CUDA for a couple applications) but here it just feels like you’re complicating a simple initialization unnecessarily.
Not sure, since I’m a newbie, but I think your problem happens because you are using recursion and maybe the compiler cannot handle it, since it tries to inline every call to a __device function. You can try to use noinline to check whether this is true or not.
My problem is in reality more complicated, I have just simplified the situation with something that underline my question. So I know that for this kind of situation I could use a simple loop but in fact I would like to use my shared data as an argument of another function at least for a better understanding of Cuda and at most for reducing processing time. Thank you for your help !
I did what you ask me in a correct main.cu (already tested) and when I was running the compilation (I am under Windows and Visual Studio 2005) I get this :
I test all what you ask me but it doesn’t work very well because sometimes the compiler tell me that there is not always a return (in fact your return inside if) or simpler the compiler never finish its work so…
However I did something that work :
template device uchar setSmem( uchar* sMat, int tx, int ty )
{return sMat[i] = tx + ty + setSmem( sMat, tx, ty );}
template<> device uchar setSmem<-1>( uchar* sMat, int tx, int ty )
{return 0;}