Is This Kernel Oprimal

Kernel is calculating square mean by moving window along each column (signal can be assumed to be matrix of NumSamples(width) x NumSweeps(height) size)

extern “C” global void HeapCalc(float2 *signal, float *di_signal, int window_size, int NumSamples, int NumSweeps)
{
int sample = blockIdx.x * blockDim.x + threadIdx.x; (NumSamples is split into blocks, so that each thread is calculating separate column)

    //need to get square mean for first elements
float Di = 0;
for (int sweep = 0; sweep < window_size; sweep++)
   {
	Di += (signal[sweep * NumSamples + sample].x * signal[sweep * NumSamples + sample].x + signal[sweep * NumSamples + sample].y * signal[sweep * NumSamples + sample].y);
    }
di_signal[sample] = Di;

//main cycle (just need to add new element and remove last element and store result)
for (int sweep = window_size; sweep < NumSweeps; sweep++)
{
	Di += (signal[sweep * NumSamples + sample].x * signal[sweep * NumSamples + sample].x + signal[sweep * NumSamples + sample].y * signal[sweep * NumSamples + sample].y);
	Di -= (signal[(sweep - window_size) * NumSamples + sample].x * signal[(sweep - window_size) * NumSamples + sample].x + signal[(sweep - window_size) * NumSamples + sample].y * signal[(sweep - window_size) * NumSamples + sample].y);
	
	di_signal[(sweep - window_size + 1) * NumSamples + sample] = Di;
}

}

Is this kernel optimal or are there some other tricks to calculate window functions?

Hi,

First you should load data to shared memory and not work directly with gmem.

Second - look at reduction samples in the SDK they’ll come handy sooner or later.

eyal

Only profiling will tell, and only you can do that, but my first reaction is not even close. If you want to see a reasonably efficient array column summation kernel looks like (which is roughly what this is), then this might be of interest. The windowing function is obviously a bit different to simple summation, but not so much that you shouldn’t be able see how to adapt the idea to your needs.

Here is profiler screen shot.[attachment=15735:HeapCalc.png]

You have managed to coalesced the memory reads, which is half of the battle for sure. What is the filter window size? If it is really large, then an in shared memory reduction is probably the right way to go, if it is relatively small, then you should be able to keep the complete window data in a shared memory buffer and just “push” new data onto the shared memory window “cache” . The SDK includes a high order finite difference kernel which does roughly the same thing you can have a look at too. Both approaches would let you compute the final result in a single kernel without requiring the second pass.

filter window size = 16
NumSamples = 8192
NumSweeps = 1024
Block Size = 512
#Blocks = 16
execution time (8800GT): 0.012 sec

you just need to break down the NumSamples into chunks (even if it means multiple kernel invocations at start - that would be easier

to code). Make sure you take into account the occupancy issues (with relation to shared memory) when you do so.

eyal

What do you mean saying “break down the NumSamples into chunks”? NumSamples is split into 16 blocks(chunks), each block is 512 threads.

That depends on your code :)

but basically, I have a similar conecpt, where I need to process an unknown size of input data (100 samples to 3000 for example)

If you use a thread block of 256 threads, hence at least one vector of shared memory of 256 floats (that is why I said you also need

to check occupancy) - then one thread block of 256 threads, will load 256 elements into shared memory and then you can

do the sliding window thing (with/without reduction or any other logic you need).

Since you have a limited shared memory per block + occupancy considerations you’ll probably want to break the 100 or 3000 samples

into chunks of 256 (sizeof your thread block) and process 256 elements each time…

If this is what you already did - then great :)

eyal

I’ve got it :) The reason to split NumSamples into several kernel executions is to have more shared memory per block. But i do not understand the point to use shared memory here. Does shared memory always fasten kernel? Even if there is coalesced gmem operations. Or it is better to separate works, like: store to shared mem, calculations, store to shared mem, calculations, … . I thought it does not matter actually if there is enough warps so scheduler will mix calculations and gmem operations in the best way.

The idea of using shared memory is to allow each data point to be read from global memory only once. You read into a shared memory buffer in a coalesced fashion, perform the calculations which read from shared memory rather than global memory.

Well consider your code snippet:

for (int sweep = window_size; sweep < NumSweeps; sweep++)

{

   Di += (signal[sweep * NumSamples + sample].x * signal[sweep * NumSamples + sample].x + signal[sweep * NumSamples + sample].y * signal[sweep * NumSamples + sample].y);

   Di -= (signal[(sweep - window_size) * NumSamples + sample].x * signal[(sweep - window_size) * NumSamples + sample].x + signal[(sweep - window_size) * NumSamples + sample].y * signal[(sweep - window_size) * NumSamples + sample].y);

di_signal[(sweep - window_size + 1) * NumSamples + sample] = Di;

}

}

First you should probably use some temporary float2 register and not repeat the signal[…] 4 or 8 times… If the compiler is not

smart enough you’ll actually access gmem way too much times, something like this:

for (int sweep = window_size; sweep < NumSweeps; sweep++)

{

   float2 fValue = signal[sweep * NumSamples + sample];

   float2 fValueI = signal[(sweep - window_size) * NumSamples + sample ];

   Di += (fValue.x * fValue.x + fValue.y * fValue.y);

   Di -= (fValueI.x * fValueI.x + fValueI.y * fValueI.y);

di_signal[(sweep - window_size + 1) * NumSamples + sample] = Di;

}

}

Much more readable as well :)

As for the shared memory use. consider the ( sweep - window_size ) * NumSamples + sample indexing…

If you have 256 threads that should access near memory cells over and over again, as windowing would suggest, it will

be probably faster to read the data once, put it in shared memory and use the data in shared memory over and over again.

Shared memory access like this will probably introduce bank conflicts, but if it will save you access to gmem it will be worth the trouble.

Hope that makes sense…

eyal

Compiler proved to be smart enough and i got no difference in execution time, but readability is getting better, thank you :) Actually i thought that i read each element once in gmem. Now, i see that 2 readings for one element take place :) I will try to use shared memory (but not sure it will solve because only 2 readings). Then i ask your question other way: if there is coalesced gmem operaions with no overhead reading can usage of shared memory help?

No. Shared memory is really only useful for helping coalesce non-uniform memory access patterns or where the same memory read gets repeated multiple times. If you code has perfect coalesced memory access and only needs a given global memory values once, then you won’t get any improvement if you are using shared memory.

Avidday is of course correct :) when using sliding windows its likely that your access to memory will overlap at a certain time.

Do you know whether threads in different blocks/kernel invocations read the same data? if so you try to see how to read the data

only once, the GPU bandwidth is very high however its still faster to read less data (and probably into smem if you can re-use it).

Also - maybe you can store the square root values in the input arrays. In your code you do this:

float2 fValue = signal[sweep * NumSamples + sample];

   float2 fValueI = signal[(sweep - window_size) * NumSamples + sample ];

   Di += (fValue.x * fValue.x + fValue.y * fValue.y);

   Di -= (fValueI.x * fValueI.x + fValueI.y * fValueI.y);

Can you change it to this:

float2 fValue = signal[sweep * NumSamples + sample];

   float2 fValueI = signal[(sweep - window_size) * NumSamples + sample ];

   Di += fValue.x + fValue.y;	// where the new fValue.x == fValue.x * fValue.x (the old ones...)

   Di -= fValueI.x + fValueI.y;  // Same as above??

This might save you FLOPs on the GPU - this is reasonable only if it fits your kernel’s logic and the change

on the CPU/host side will not result in further time penalty.

eyal

I have overlap of “second order” only, that is: i read each value twice and only by one thread (when adding new element to sum and when deleting old one).
I am thinking about using shared mem here, but got puzzled hot to use it here and if it will shrink time execution.
my card has 16384 bytes of smem available per block. so (if i am correct) i have (16384 / #blocks (8192/512 = 16)) = 1024 bytes per block. is it so?

Probably having smem of size equal to window size is enough.

What improvement in percents i can get here if i eliminate second reading? your expectations

What does “Local Store” means in profiler? there is not enough info in Profiler Help. In my case i have about 12% of local store. I thought “Local Store” is got when there are not enough registers, but for this kernel it looks strange if registers are not enough.

Unfortunately i can not use Signal.x + Signal.y as norm, cause i am calculating amplitudes of complex signal.

Can you further elaborate here? I dont understand what that means…

No this is not correct. You have a bit less than the actuall 16K (for internal stuff) and the number of concurrent blocks is not the number of

samples divided by 512 but is limited to resource usage - see the occupancy calculator to determine how many concurrent blocks at max you

can have.

eyal

No exactly. The limit shared memory limit is per multiprocessor, so it depends on how many blocks are running. With 512 threads per block, you will only get 1 active block per multiprocessor (there is a limit of 768 active threads on your hardware). So you could have all 16kb per block if you wanted. If you launch 256 threads per block, you could have 3 active blocks (768 threads total), in which case you would want to limit your shared memory to 16384/3 bytes per block.

Here is a version with smem usage: (i choosed 32 threads for block cause of limit in smem size, 64 was also available but gives the same time)

#define WINDOW_SIZE 16
#define NUM_THREADS_PER_BLOCK 32
extern “C” global void HeapCalc(float2 *signal, float *di_signal, int NumSamples, int NumSweeps)
{
shared float windowmemory[NUM_THREADS_PER_BLOCK][WINDOW_SIZE + 1]; //+1 to eliminate bank conflicts

int sample = blockIdx.x * blockDim.x + threadIdx.x;

float Di = 0;

for (int sweep = 0; sweep < WINDOW_SIZE; sweep++)
{
	float2 fValueNew = signal[sweep * NumSamples + sample];
	float amplitude = (fValueNew.x * fValueNew.x + fValueNew.y * fValueNew.y);
	Di += amplitude;
	windowmemory[threadIdx.x][sweep] = amplitude;
}
	
di_signal[sample] = Di;

int windowmemory_level = 0;

for (int sweep = window_size; sweep < NumSweeps; sweep++)
{
	float2 fValueNew = signal[sweep * NumSamples + sample];
	float amplitude = (fValueNew.x * fValueNew.x + fValueNew.y * fValueNew.y);
	Di += amplitude;
	Di -= windowmemory[threadIdx.x][windowmemory_level];
	windowmemory[threadIdx.x][windowmemory_level] = amplitude;
	windowmemory_level++;
	if (windowmemory_level == WINDOW_SIZE) windowmemory_level = 0;
    di_signal[(sweep - window_size + 1) * NumSamples + sample] = Di;	
}

}

strange thing is when i compared results:
NumSamples = 8192
NumSweeps = 1024

single run:
0.012(no smem) vs 0.013(smem)

100 runs (without memcpy, just a “for loop” with kernel call)
0.37 vs 0.29

So which one is faster?

So as i understand i have 8 (16384 / (32 (threads) * 16 (windowsize) * 4 (float))) active blocks in new version. it looks small number, but probably enough :)

here is profiler screen shot.