Trying to understand CUDA application in signal processing research

Dear friends,

We are working with NVIDIA CUDA to implement an ECG-signal filter using wavelets. We are trying to understand a similar application described in a paper (http://ceur-ws.org/Vol-1543/p5.pdf) which describes the utilization of wavelet biorthogonal 3.7 and as far we understand, it does the work of both low and high pass filters. Authors talk about using it in a tree form:

“The method of multiresolution analysis described in [10] is characterized by a redundant set of samples. We now look at a different scheme where no such redundancy appears. This method is called subbang coding scheme, first widely used in voice compression. The idea behind this technique is based on the following methodology: a pair of filters are used where low pass and a high pass filter partecipate; we decompose a sequence X(n) into two subsequences at half rate, or half resolution, and this by means of orthogonal filter. This process can be iterated on either or both subsequences. In particular to obtain finer frequency resolution at lower frequencies, we iterate the scheme on the lower band only. Each iteration halves the width of the low band (in fact it increases its frequency resolution by two), but because of the subsampling by two, its time resolution is halved as well. Schematically, this is shown in paper’s figure 2.

An important feature of this discrete algorithm is its relatively low complexity. Actually, the following somewhat surprising result emerges: regardless the depth of the tree in 2, the complexity is linear in the number of input samples, by a constant factor which depends on the length of the filter [10].“

and in the code they do it as shows:

// kernel definition
__global__ void wavelet(float *sign, float *fil, float*y,
int m) {
int id = blockIdx.x * blockDim.x + threadIdx.x;
if (id>=sample) return;
int P = m*sample;
for (int i=0; i<filter; i++)
if ((id-i) > 0) y[P+id]+= sign[id-i]* fil[i];
else y[P+id]+= sign[i-id]* fil[i];
int div = (int) powf(2,m);
sign[id] = sign[id-(id%div)];
return;
}

We understand the function(kernel) but not the last two lines. The explanation says:
“As we can see by code 1, wavelet function takes as input not only the point-to-float variables sign, fil, y, but also another parameter, m, that is the number of current iteration. In fact, as we shall soon see, the purpose of this paper is adopting subband coding technique in a tree-structure, iterating the kernel a number of times equal to num iter in order to isolate step by step residuals and details of signal and get different levels of resolution. Therefore we achieve also the scaling procedure organizing step by step samples in groups of 2^m elements such that they assume the same value in each group.”
So the real question is, What is the purpose of sign[id]= sign[id-(id%div)];? what we have seen so far is that this will “erase” some elements of the signal and I think for this to be useful at the end the signal must be all the first element (or what is the point of replacing non-power of two positioned elements?) and to reach this num_iter must be log2(numSamples), but I’m not sure about it,. Another question is: they say “sample is the amount of randomic signal samples taken into account”, so in each iteration, sample must be 2^m? or the total number of samples?

Thank you in advance for any advice!

Héctor and Julian

I don’t have insights into the signal processing side here, did you consult related publications for a possible explanation? Have you tried contacting the authors of the paper? sign[id]= sign[id-(id%div)] forces 2**m consecutive values of sign to the same value. E.g. if m=2, div=4, then sign[1], sign[2], sign[3] are forced to the value of sign[0]; sign [5], sign[6], sign[7] are forced to the value of sign [4]; etc; etc

Does the sign array contain only values of either 1 or -1 as the name suggests? If so, I would investigate whether that array is actually needed, as doing away with it could cut down on memory traffic, which should help performance.

For maximum efficiency:

(1) Use the ‘restrict’ modifier on the pointer arguments to the kernel.

(2) powf() is way too big a hammer for constructing integer powers of two:

Suggestion: use (1 << m) instead.

(3) No expensive modulo operation is needed if all one wants to do is simply mask the m low-order bits of an index:

Suggestion: use [id & ~(div-1))]