Speed improvement

I’ve been working on this project for a few weeks now and finally I made it compute correctly, however the speed increase is terribly disappointing.

The core of the algorithm is an expensive computation that does a sort of an inner product of two large arrays. This operation is pretty slow and takes up most of the processing time.

Inputs are:

  • signal - a 1D array of length sig_len (eg. 8000 per sec of signal at 8kHz)

  • dict - a 2D array of width kernel_num and length max_kernel_len (eg. 64x500)

Output is saved into:

  • res - a 2D array of width kernel_num and length sig_len

The operation is as follows:

res(i,j) = sum{k}( signal(j+k)*dict(i,k) ) , for i=1…kernel_num AND j=1…sig_len AND k=1…max_kernel_len

On the CPU this is implemented using 3 'for’s and it takes several seconds for any signal of reasonable size (ie. it is slow!).

On the GPU, I spilt the process in such a way that each cell of the “res” matrix is calculated in a separate thread (they are essentially independent).

Cells in each column of the matrix (index kernel_num) are split among the threads and all the columns are sequentially calculated in separate grids. Here I assume that kernel_num is always smaller than max number of threads per grid. Since each cell in the res array is calculated in parallel I expected the speedup to be linear to the number of processors on the GPU, compared to the CPU.

What I got was barely a 2x speedup on a 8600 GT (I know, I’m cheap). The function runs 2.86 seconds on the CPU and 1.188 seconds on the same data on the GPU. And this is the time of the function itself, excluding the run-time of starting up and uploading the data to the card. What am I doing wrong?

The implementation of the function in CUDA is as follows:


global void dev_inner_product(float* signal, float* dict, int* kern_len, float* res, int sig_len, int kern_num, int max_kern_len, int offset)


int pos = blockIdx.x;

int kernel = threadIdx.x;



if(*res<=LZERO) return;//we aren't allowed to touch this one



int it=kern_len[kernel];


{//kernel is longer than signal







	(*res) += (*signal) * (*dict);






void gpu_inner_product(int off, int len)


dev_inner_product<<<len, kern_num>>>(gpu_sig,gpu_dict,gpu_kern_len,gpu_res,sig_len,kern_num,m






My gut feeling is that you are more limited by I/O that floating point performance. The theoretical bandwidth for an 8600 GT is 22 GB/sec, which is very roughly speaking, probably about 2x or 3x the memory bandwidth of your CPU. It’s probably worth computing how many bytes you read and write from memory and seeing how it compares to the theoretical GPU memory peak. (Note that uncoalesced memory access will take you far below the theoretical.) For comparison, the theoretical peak memory bandwidth of the GTX 280 is 140 GB/sec.

Many CUDA kernels are actually limited by memory and not the number or speed of the stream processors, so it’s worth estimating that on paper when implementing an algorithm.

Are you doing G723?

Not really. But I am doing something speech related.

Thanks for your reply. You have good gut feelings, sir. By changing even one variable from dynamic to shared memory I managed to achieve a slight improvement. Now, I’ll go look for a better card (8800GTX or 9800GTX maybe?) and think about how to minimize dynamic memory access…

I haven’t examined your code in detail, but it does look like you’re doing a lot of un-coalesced loads. Could you fit the “dict” array in constant memory?

I was thinking of that, but seems like a tight fit. The array is usually 500x64 of floating point values. Also, I would need to fit the signal array of at least 500 floating points in as well. I’ll have to think of how to split the calculation into several steps to minimize the memory access…

Do all nvidia cards have this limit of 16kB for “fast” memory?

And where can I read definitions of “coalesced” and “uncoalesced”? The documentation is really hard to read because of the terminology. Nvidia shuold make a wiki instead of these PDFs, so they can link all the terms…

I currently use 8600gt :no: .

From experience I can say that, if your kernel is compute bound then you can achieve significant speed up ( given u have coalesced memory access ). I was able to reach 75% of theoretical G flops (around 75 on 8600gt) for my algorithm. It had 25 flops per thread.

But if your application is memory bound then the performance increase may not be that significant. But still u can improve a lot by trying to use less register or shared memory so that you have near about 100% occupancy( use the CUDA occupancy calculator ). That way the compiler can best hide the latency which can give a good performance boost.

Yes All Nvidia cards to my knowledge ( don’t know about the Tesla’s) have the 16 Kb limit. But its not that bad if you have an intelligent algorithm.


Nittin arora

Seems like my algorithm is memory bound. In the first version I had two float operations (one multiply and one add) on 3 memory operations (two reads and one write). I had completely disregarded this while designing the algorithm, so I have to rethink everything.

Do I understand this correctly, that memory access from several threads isn’t run in parallel? So if thread A wants to read address 1 and thread B wants to read address 2 they can’t both do it at the same time (unless it’s shared memory in different banks)?

No, “parallel reads” are in fact the definition of coalesced memory access. You will get optimal bandwidth if different threads read consecutive memory locations at the same time. The rules are more restrictive on pre-GT200 chips, and described in the Programming Guide. (Keep in mind that the memory bus is 128 bits on the 8600 GT and 512 bits on the GTX 280.) But you definitely want to be reading from several addresses to maximize memory bandwidth.

Now, if your reads are totally random, then they cannot be coalesced, and you will underutilize the memory bus.

WOW! I moved one array from global memory to shared and I got a huge improvement!

CPU: 2.8342 seconds
Old Method: 1.187 seconds (2.387xCPU)
New Method: 0.4468 seconds (6.34xCPU)

Now, if only I could shoot all the data to shared memory, I’ll be in business!

Thank you all for the advice!

Put your constant array (dict probably is) in constant memory.

One row of the dict array has up to 400 float values. There are 64 rows. Isn’t the constant memory limited to 16k?

The programming guide has all the answers.
The constant memory is 64k, so it still wont fil.

Looks like there is something to do with shared memory there too. (If i understand correctly) each thread will read the same elements from the dict but at different times. Youre therefore reading the same dict element multiple times. It could be cached in shared memory. When the thread block is done with a given dict cache, reload the shared memory with new dict values.

Well, I got my employer to buy me a better card, but in the meantime, can I get someone to run this simple experiment on some better hardware (e.g. 9800GTX or GTX280). I would be very grateful for 2-3 runs (takes about 2-3 mins for each run). Just copy the output to a textfile or post directly here.

Thank you so much.
experiment.zip (304 KB)

Results from my 8800GT:

Initializing GPU…
Using device 0: GeForce 8800 GT
Loading signal…
Loading dictionary…
Running algorithm on CPU…
-running initial inner product…DONE! (1.656s) (1.64)(1.656)
DONE! Took 52.672 secs… (52.532)(52.5)
Running algorithm on GPU…
-allocating memory…DONE!
-uploading data…DONE!
-running initial inner product…DONE! (0.171s) (0.172)(0.188)
DONE! Took 19.094 secs…(18.89)(18.922)
Deinitailizing GPU…

With a GTX260 I’m getting 0.109s for the initial inner product and 18.469s for the loop.

I just wanted to thank you guys for help and close this thread.

I will keep working on my code, as the improvement seems to be still quite below linear to number of concurrent threads on the gpu.

Unfortunately, according to our NVidia contacts, the 16kb shared memory limitation would not change in the near future.