__shared__ data as argument of a template ?

Hello !

I would like to set a :

  __shared__ int vector[VECTOR_SIZE]

declared in a global function and setted by :

  setVector<VECTOR_SIZE-1>( sharedVector )

through a device recursive template funtion defined like this :

 template<int i> __device__ int setVector( int* sharedVector )
 {
     sharedVector[i] = i+1;
     return setVector<i-1>( sharedVector );
 }

 template<> __device__ int setVector<-1>( int* sharedVector )
 {
     return 0;
 }

I have read several “old” topics about problems like this but I didn’t find solution (even with volatile).

If someone can help me to solve it, it will be very nice :rolleyes:

Why not just use a loop and skip the fancy templates?
Efficiency will be the same since CUDA will unroll the loop for sure, since VECTOR_SIZE is known at compile time.

BTW there’s nothing wrong with using templates (I use metaprogramming in CUDA for a couple applications) but here it just feels like you’re complicating a simple initialization unnecessarily.

Not sure, since I’m a newbie, but I think your problem happens because you are using recursion and maybe the compiler cannot handle it, since it tries to inline every call to a __device function. You can try to use noinline to check whether this is true or not.

Hi SPWorley and thank you for your reply ;)

My problem is in reality more complicated, I have just simplified the situation with something that underline my question. So I know that for this kind of situation I could use a simple loop but in fact I would like to use my shared data as an argument of another function at least for a better understanding of Cuda and at most for reducing processing time. Thank you for your help !

Hi Noel Lopes, my recursive function works very well with other type of arguments because of a low level of recursivity (VECTOR_SIZE 10 for example).

And when I test a simple function like one which set a shared data like :

void setToi( int *sharedVector, int index, int a )
{sharedVector[index] = a;}

It works very well, so it could be about the recusivity but how to resolve it ?

Can you try this (I don’t have cuda installed on this computer to test myself):

template device int setVector( int* sharedVector )
{
if (i ==0) {
return 0;
}
if (i >0) {
sharedVector[i] = i+1;
return setVector( sharedVector );
}
}

Let me know if it works.

I did what you ask me in a correct main.cu (already tested) and when I was running the compilation (I am under Windows and Visual Studio 2005) I get this :

1>------ Build started: Project: Application_exe, Configuration: Release Win32 ------
1>Compiling…
1>main.cu

and no more, the compilation runs but there is no end.

I tried this :

shared int m[500]

setVector<500>( m );
setVector<15>( m );
setVector<1>( m );

No results :wacko:

Actually the correct code should be (sorry about the initial code):

template device int setVector( int* sharedVector )
{
if (i < 0) {
return 0;
}
if (i >=0) {
sharedVector[i] = i+1;
return setVector( sharedVector );
}
}

but I don’t think this would solve it (you can try it however). Another thing you can do is this :

template device int setVector( int* sharedVector)
{
if (i < 0) {
return 0;
}
if (i >=0) {
sharedVector[i] = i+1;
return setVectorAux( sharedVector );
}
}

template device int setVectorAux( int* sharedVector)
{
if (i < 0) {
return 0;
}
if (i >=0) {
sharedVector[i] = i+1;
return setVector( sharedVector );
}
}

And thats it (I’m out of ideas). Hope this works.

Hi,

I test all what you ask me but it doesn’t work very well because sometimes the compiler tell me that there is not always a return (in fact your return inside if) or simpler the compiler never finish its work so…

However I did something that work :

template device uchar setSmem( uchar* sMat, int tx, int ty )
{return sMat[i] = tx + ty + setSmem( sMat, tx, ty );}

template<> device uchar setSmem<-1>( uchar* sMat, int tx, int ty )
{return 0;}

And I call it with this :

#define matsize 50

shared uchar sharedMat[matsize];
__syncthreads();
RI[ index ] = setSmem( sharedMat, tx, ty );

That’s work perfectly until matsize is less than 779.

In fact when matsize is 779, compiler tell me :
1>nvopencc ERROR: C:\Program Files\CUDA\bin/…/open64/lib//be.exe returned non-zero status -1073741819

and even if my code can be executed, results are those I got with my last good value of matsize (still in memory).

Does anybody have a theorical explanantion of this limitation ?

Thank you for everything Noel Lopes ! External Image