I’ve been working on this project for a few weeks now and finally I made it compute correctly, however the speed increase is terribly disappointing.
The core of the algorithm is an expensive computation that does a sort of an inner product of two large arrays. This operation is pretty slow and takes up most of the processing time.
Inputs are:
-
signal - a 1D array of length sig_len (eg. 8000 per sec of signal at 8kHz)
-
dict - a 2D array of width kernel_num and length max_kernel_len (eg. 64x500)
Output is saved into:
- res - a 2D array of width kernel_num and length sig_len
The operation is as follows:
res(i,j) = sum{k}( signal(j+k)*dict(i,k) ) , for i=1…kernel_num AND j=1…sig_len AND k=1…max_kernel_len
On the CPU this is implemented using 3 'for’s and it takes several seconds for any signal of reasonable size (ie. it is slow!).
On the GPU, I spilt the process in such a way that each cell of the “res” matrix is calculated in a separate thread (they are essentially independent).
Cells in each column of the matrix (index kernel_num) are split among the threads and all the columns are sequentially calculated in separate grids. Here I assume that kernel_num is always smaller than max number of threads per grid. Since each cell in the res array is calculated in parallel I expected the speedup to be linear to the number of processors on the GPU, compared to the CPU.
What I got was barely a 2x speedup on a 8600 GT (I know, I’m cheap). The function runs 2.86 seconds on the CPU and 1.188 seconds on the same data on the GPU. And this is the time of the function itself, excluding the run-time of starting up and uploading the data to the card. What am I doing wrong?
The implementation of the function in CUDA is as follows:
[codebox]
global void dev_inner_product(float* signal, float* dict, int* kern_len, float* res, int sig_len, int kern_num, int max_kern_len, int offset)
{
int pos = blockIdx.x;
int kernel = threadIdx.x;
pos+=offset;
res+=pos*kern_num+kernel;
if(*res<=LZERO) return;//we aren't allowed to touch this one
signal+=pos;
dict+=kernel*max_kern_len;
int it=kern_len[kernel];
if(it>(sig_len-pos))
{//kernel is longer than signal
*res=LZERO;
return;
}
*res=0;
while(it>0)
{
(*res) += (*signal) * (*dict);
signal++;
dict++;
it--;
}
}
void gpu_inner_product(int off, int len)
{
dev_inner_product<<<len, kern_num>>>(gpu_sig,gpu_dict,gpu_kern_len,gpu_res,sig_len,kern_num,m
ax_kern_len,off);
CUT_CHECK_ERROR("GPU KERNEL FAILED: inner_product");
CUDA_SAFE_CALL(cudaThreadSynchronize());
}
[/codebox]