I have a requirement to do a 1024 value 1d convolution. so i started off with the convolution sample in sdk based on texture.
Now the loop unroll using template cannot do 1024 unroll as the complier throws "error: excessive recursion at instantiation of function " … [( convolutionRow< 1024 > ) throws error ]
it seems that it can unroll upto around 200 only ( found out by trial and error ).
is there any way to do it.? or any other suggestions ?
I have a requirement to do a 1024 value 1d convolution. so i started off with the convolution sample in sdk based on texture.
Now the loop unroll using template cannot do 1024 unroll as the complier throws "error: excessive recursion at instantiation of function " … [( convolutionRow< 1024 > ) throws error ]
it seems that it can unroll upto around 200 only ( found out by trial and error ).
is there any way to do it.? or any other suggestions ?
well yes, the compiler has a limit in template recursion.
the solution maybe to unroll your loop in a nested loop:
//assuming there were C++0x's lambdas available in CUDA...
UnrollerP<8>::step([](int i) {
UnrollerP<128>::step([](int j) {
int index = i * 128 + j;
//great stuff with index
});
});
there is another problem, kernel size may not exceed 2 KB…with this massive amount of unrolling you might even get a slowdown if the unrolled loop does not fit into the instruction cache.
normally, you dont have to unroll the whole loop. That’s why there is the partial Unroller
well yes, the compiler has a limit in template recursion.
the solution maybe to unroll your loop in a nested loop:
//assuming there were C++0x's lambdas available in CUDA...
UnrollerP<8>::step([](int i) {
UnrollerP<128>::step([](int j) {
int index = i * 128 + j;
//great stuff with index
});
});
there is another problem, kernel size may not exceed 2 KB…with this massive amount of unrolling you might even get a slowdown if the unrolled loop does not fit into the instruction cache.
normally, you dont have to unroll the whole loop. That’s why there is the partial Unroller
if we see the convolution code in sdk sample, there is only one line of actual code and other is just a loop, which causes the kernels total instruction count to increase by almost 2.
for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++)
sum += tex2D(texSrc, x + (float)k, y) * c_Kernel[KERNEL_RADIUS - k];
this is the loop to be unrolled ( KERNEL_RADIUS is 1024 )
if we see the convolution code in sdk sample, there is only one line of actual code and other is just a loop, which causes the kernels total instruction count to increase by almost 2.
for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++)
sum += tex2D(texSrc, x + (float)k, y) * c_Kernel[KERNEL_RADIUS - k];
this is the loop to be unrolled ( KERNEL_RADIUS is 1024 )