Kernel is calculating square mean by moving window along each column (signal can be assumed to be matrix of NumSamples(width) x NumSweeps(height) size)
extern “C” global void HeapCalc(float2 *signal, float *di_signal, int window_size, int NumSamples, int NumSweeps)
{
int sample = blockIdx.x * blockDim.x + threadIdx.x; (NumSamples is split into blocks, so that each thread is calculating separate column)
//need to get square mean for first elements
float Di = 0;
for (int sweep = 0; sweep < window_size; sweep++)
{
Di += (signal[sweep * NumSamples + sample].x * signal[sweep * NumSamples + sample].x + signal[sweep * NumSamples + sample].y * signal[sweep * NumSamples + sample].y);
}
di_signal[sample] = Di;
//main cycle (just need to add new element and remove last element and store result)
for (int sweep = window_size; sweep < NumSweeps; sweep++)
{
Di += (signal[sweep * NumSamples + sample].x * signal[sweep * NumSamples + sample].x + signal[sweep * NumSamples + sample].y * signal[sweep * NumSamples + sample].y);
Di -= (signal[(sweep - window_size) * NumSamples + sample].x * signal[(sweep - window_size) * NumSamples + sample].x + signal[(sweep - window_size) * NumSamples + sample].y * signal[(sweep - window_size) * NumSamples + sample].y);
di_signal[(sweep - window_size + 1) * NumSamples + sample] = Di;
}
}
Is this kernel optimal or are there some other tricks to calculate window functions?
Only profiling will tell, and only you can do that, but my first reaction is not even close. If you want to see a reasonably efficient array column summation kernel looks like (which is roughly what this is), then this might be of interest. The windowing function is obviously a bit different to simple summation, but not so much that you shouldn’t be able see how to adapt the idea to your needs.
You have managed to coalesced the memory reads, which is half of the battle for sure. What is the filter window size? If it is really large, then an in shared memory reduction is probably the right way to go, if it is relatively small, then you should be able to keep the complete window data in a shared memory buffer and just “push” new data onto the shared memory window “cache” . The SDK includes a high order finite difference kernel which does roughly the same thing you can have a look at too. Both approaches would let you compute the final result in a single kernel without requiring the second pass.
I’ve got it :) The reason to split NumSamples into several kernel executions is to have more shared memory per block. But i do not understand the point to use shared memory here. Does shared memory always fasten kernel? Even if there is coalesced gmem operations. Or it is better to separate works, like: store to shared mem, calculations, store to shared mem, calculations, … . I thought it does not matter actually if there is enough warps so scheduler will mix calculations and gmem operations in the best way.
The idea of using shared memory is to allow each data point to be read from global memory only once. You read into a shared memory buffer in a coalesced fashion, perform the calculations which read from shared memory rather than global memory.
Compiler proved to be smart enough and i got no difference in execution time, but readability is getting better, thank you :) Actually i thought that i read each element once in gmem. Now, i see that 2 readings for one element take place :) I will try to use shared memory (but not sure it will solve because only 2 readings). Then i ask your question other way: if there is coalesced gmem operaions with no overhead reading can usage of shared memory help?
No. Shared memory is really only useful for helping coalesce non-uniform memory access patterns or where the same memory read gets repeated multiple times. If you code has perfect coalesced memory access and only needs a given global memory values once, then you won’t get any improvement if you are using shared memory.
float2 fValue = signal[sweep * NumSamples + sample];
float2 fValueI = signal[(sweep - window_size) * NumSamples + sample ];
Di += fValue.x + fValue.y; // where the new fValue.x == fValue.x * fValue.x (the old ones...)
Di -= fValueI.x + fValueI.y; // Same as above??
This might save you FLOPs on the GPU - this is reasonable only if it fits your kernel’s logic and the change
on the CPU/host side will not result in further time penalty.
I have overlap of “second order” only, that is: i read each value twice and only by one thread (when adding new element to sum and when deleting old one).
I am thinking about using shared mem here, but got puzzled hot to use it here and if it will shrink time execution.
my card has 16384 bytes of smem available per block. so (if i am correct) i have (16384 / #blocks (8192/512 = 16)) = 1024 bytes per block. is it so?
Probably having smem of size equal to window size is enough.
What improvement in percents i can get here if i eliminate second reading? your expectations
What does “Local Store” means in profiler? there is not enough info in Profiler Help. In my case i have about 12% of local store. I thought “Local Store” is got when there are not enough registers, but for this kernel it looks strange if registers are not enough.
Unfortunately i can not use Signal.x + Signal.y as norm, cause i am calculating amplitudes of complex signal.
No exactly. The limit shared memory limit is per multiprocessor, so it depends on how many blocks are running. With 512 threads per block, you will only get 1 active block per multiprocessor (there is a limit of 768 active threads on your hardware). So you could have all 16kb per block if you wanted. If you launch 256 threads per block, you could have 3 active blocks (768 threads total), in which case you would want to limit your shared memory to 16384/3 bytes per block.
strange thing is when i compared results:
NumSamples = 8192
NumSweeps = 1024
single run:
0.012(no smem) vs 0.013(smem)
100 runs (without memcpy, just a “for loop” with kernel call)
0.37 vs 0.29
So which one is faster?
So as i understand i have 8 (16384 / (32 (threads) * 16 (windowsize) * 4 (float))) active blocks in new version. it looks small number, but probably enough :)