One thing I have recently learned (and I’ve been doing this for 5+ years) is that if you’re doing an operation on a ‘float’ and it doesn’t rely on neighbor values, you always get faster times operating on multiple values/thread for a slight increase in register usage. If your register used isn’t high already (which is definitely the case here), do a recasting of the input pointer to float4* and compute the four values at once (sometimes with the overloaded math functions from cutil_math.h, which you should replicate into your own structure and tweak as needed, for example as below or with complex types; don’t rely on nVidia to keep supporting it, 'cause, well it’s already not supported by them). You also may need to adjust the launching call and endpoint thread checks, and my assumption here is that you can guarantee a width that is a multiple of four or the endpoint checking gets a little ugly.
Note the device function in this case isn’t ‘overloaded’, it just makes the code a little cleaner below. You could certainly forgot that and just hard code it right in the kernel function. Additionally, you could merge several of the kernel lines together, but in the end, the compiler will essential do this for you as well.
I still don’t understand why this help as much as it does. I understand amortizing the function overhead and management into fewer threadblocks and fetch calls, but I assumed 4 32-bit fetches were the same as 1 128-bit fetch. This doesn’t appear to be the case from profiling.
This has worked well for me, but only on ‘simple’ arithmetic (mul/adds). The additional ops of the transcendental functions (like you have here) may not help you as much, but if you’re I/O limited (which you most likely are), it’s worth a try.
Tried to highlight changes in bold. Haven’t checked for accuracy; caveat lector.
inline device float4 tanhf4(float4 input) {
return make_float4(tanhf(input.x),tanhf(input.y),tanhf(input.z),tanhf(input.w));
}
Kernal:
global void DevFinalizeFinalLayerOutput(float* c, unsigned int width, unsigned int height)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid<<2 < width * height)
{
float4 newptr=(float4)c;
float4 oldval=newptr[tid];
newptr[tid] = tanhf4(oldval);
tid += blockDim.x * gridDim.x;
}
}
Host:
int FinalizeFinalLayerOutput(float* dev, unsigned int width, unsigned int height)
{
const unsigned int noOfthreads = 128;
const unsigned int maxBlocks = 128;
unsigned int noOfBlocks = (width>>2+ noOfthreads - 1) / noOfthreads;
if (noOfBlocks > maxBlocks)
noOfBlocks = maxBlocks;
DevFinalizeFinalLayerOutput<<<noOfBlocks, noOfthreads>>>(dev, width, height);
return 0;
}