Speed improvement

danijel · November 24, 2008, 6:21pm

I’ve been working on this project for a few weeks now and finally I made it compute correctly, however the speed increase is terribly disappointing.

The core of the algorithm is an expensive computation that does a sort of an inner product of two large arrays. This operation is pretty slow and takes up most of the processing time.

Inputs are:

signal - a 1D array of length sig_len (eg. 8000 per sec of signal at 8kHz)
dict - a 2D array of width kernel_num and length max_kernel_len (eg. 64x500)

Output is saved into:

res - a 2D array of width kernel_num and length sig_len

The operation is as follows:

res(i,j) = sum{k}( signal(j+k)*dict(i,k) ) , for i=1…kernel_num AND j=1…sig_len AND k=1…max_kernel_len

On the CPU this is implemented using 3 'for’s and it takes several seconds for any signal of reasonable size (ie. it is slow!).

On the GPU, I spilt the process in such a way that each cell of the “res” matrix is calculated in a separate thread (they are essentially independent).

Cells in each column of the matrix (index kernel_num) are split among the threads and all the columns are sequentially calculated in separate grids. Here I assume that kernel_num is always smaller than max number of threads per grid. Since each cell in the res array is calculated in parallel I expected the speedup to be linear to the number of processors on the GPU, compared to the CPU.

What I got was barely a 2x speedup on a 8600 GT (I know, I’m cheap). The function runs 2.86 seconds on the CPU and 1.188 seconds on the same data on the GPU. And this is the time of the function itself, excluding the run-time of starting up and uploading the data to the card. What am I doing wrong?

The implementation of the function in CUDA is as follows:

[codebox]

global void dev_inner_product(float* signal, float* dict, int* kern_len, float* res, int sig_len, int kern_num, int max_kern_len, int offset)

{

int pos = blockIdx.x;

int kernel = threadIdx.x;

pos+=offset;

res+=pos*kern_num+kernel;

if(*res<=LZERO) return;//we aren't allowed to touch this one

signal+=pos;

dict+=kernel*max_kern_len;

int it=kern_len[kernel];

if(it>(sig_len-pos))

{//kernel is longer than signal

	*res=LZERO;

	return;

}

*res=0;

while(it>0)

{

	(*res) += (*signal) * (*dict);

	signal++;

	dict++;

	it--;

}

}

void gpu_inner_product(int off, int len)

{

dev_inner_product<<<len, kern_num>>>(gpu_sig,gpu_dict,gpu_kern_len,gpu_res,sig_len,kern_num,m

ax_kern_len,off);

CUT_CHECK_ERROR("GPU KERNEL FAILED: inner_product");

CUDA_SAFE_CALL(cudaThreadSynchronize());

}

[/codebox]

seibert · November 25, 2008, 12:08am

My gut feeling is that you are more limited by I/O that floating point performance. The theoretical bandwidth for an 8600 GT is 22 GB/sec, which is very roughly speaking, probably about 2x or 3x the memory bandwidth of your CPU. It’s probably worth computing how many bytes you read and write from memory and seeing how it compares to the theoretical GPU memory peak. (Note that uncoalesced memory access will take you far below the theoretical.) For comparison, the theoretical peak memory bandwidth of the GTX 280 is 140 GB/sec.

Many CUDA kernels are actually limited by memory and not the number or speed of the stream processors, so it’s worth estimating that on paper when implementing an algorithm.

alex_dubinsky · November 25, 2008, 5:04am

Are you doing G723?

danijel · November 25, 2008, 9:06am

Not really. But I am doing something speech related.

danijel · November 25, 2008, 10:11am

Thanks for your reply. You have good gut feelings, sir. By changing even one variable from dynamic to shared memory I managed to achieve a slight improvement. Now, I’ll go look for a better card (8800GTX or 9800GTX maybe?) and think about how to minimize dynamic memory access…

Simon_Green · November 25, 2008, 11:11am

I haven’t examined your code in detail, but it does look like you’re doing a lot of un-coalesced loads. Could you fit the “dict” array in constant memory?

danijel · November 25, 2008, 12:07pm

I was thinking of that, but seems like a tight fit. The array is usually 500x64 of floating point values. Also, I would need to fit the signal array of at least 500 floating points in as well. I’ll have to think of how to split the calculation into several steps to minimize the memory access…

Do all nvidia cards have this limit of 16kB for “fast” memory?

And where can I read definitions of “coalesced” and “uncoalesced”? The documentation is really hard to read because of the terminology. Nvidia shuold make a wiki instead of these PDFs, so they can link all the terms…

nitin.life · November 25, 2008, 6:43pm

I currently use 8600gt External Media .

From experience I can say that, if your kernel is compute bound then you can achieve significant speed up ( given u have coalesced memory access ). I was able to reach 75% of theoretical G flops (around 75 on 8600gt) for my algorithm. It had 25 flops per thread.

But if your application is memory bound then the performance increase may not be that significant. But still u can improve a lot by trying to use less register or shared memory so that you have near about 100% occupancy( use the CUDA occupancy calculator ). That way the compiler can best hide the latency which can give a good performance boost.

Yes All Nvidia cards to my knowledge ( don’t know about the Tesla’s) have the 16 Kb limit. But its not that bad if you have an intelligent algorithm.

Thanks,

Nittin arora

danijel · November 26, 2008, 10:51am

I currently use 8600gt External Media .

From experience I can say that, if your kernel is compute bound then you can achieve significant speed up ( given u have coalesced memory access ). I was able to reach 75% of theoretical G flops (around 75 on 8600gt) for my algorithm. It had 25 flops per thread.

But if your application is memory bound then the performance increase may not be that significant. But still u can improve a lot by trying to use less register or shared memory so that you have near about 100% occupancy( use the CUDA occupancy calculator ). That way the compiler can best hide the latency which can give a good performance boost.

Yes All Nvidia cards to my knowledge ( don’t know about the Tesla’s) have the 16 Kb limit. But its not that bad if you have an intelligent algorithm.

Thanks,

Nittin arora

Seems like my algorithm is memory bound. In the first version I had two float operations (one multiply and one add) on 3 memory operations (two reads and one write). I had completely disregarded this while designing the algorithm, so I have to rethink everything.

Do I understand this correctly, that memory access from several threads isn’t run in parallel? So if thread A wants to read address 1 and thread B wants to read address 2 they can’t both do it at the same time (unless it’s shared memory in different banks)?

seibert · November 26, 2008, 1:15pm

No, “parallel reads” are in fact the definition of coalesced memory access. You will get optimal bandwidth if different threads read consecutive memory locations at the same time. The rules are more restrictive on pre-GT200 chips, and described in the Programming Guide. (Keep in mind that the memory bus is 128 bits on the 8600 GT and 512 bits on the GTX 280.) But you definitely want to be reading from several addresses to maximize memory bandwidth.

Now, if your reads are totally random, then they cannot be coalesced, and you will underutilize the memory bus.

danijel · November 26, 2008, 3:50pm

WOW! I moved one array from global memory to shared and I got a huge improvement!

CPU: 2.8342 seconds
Old Method: 1.187 seconds (2.387xCPU)
New Method: 0.4468 seconds (6.34xCPU)

Now, if only I could shoot all the data to shared memory, I’ll be in business!

Thank you all for the advice!

E.D_Riedijk · November 26, 2008, 6:38pm

Put your constant array (dict probably is) in constant memory.

danijel · November 26, 2008, 10:54pm

One row of the dict array has up to 400 float values. There are 64 rows. Isn’t the constant memory limited to 16k?

Ailleur · November 26, 2008, 11:18pm

The programming guide has all the answers.
The constant memory is 64k, so it still wont fil.

Looks like there is something to do with shared memory there too. (If i understand correctly) each thread will read the same elements from the dict but at different times. Youre therefore reading the same dict element multiple times. It could be cached in shared memory. When the thread block is done with a given dict cache, reload the shared memory with new dict values.

danijel · November 29, 2008, 7:42pm

Well, I got my employer to buy me a better card, but in the meantime, can I get someone to run this simple experiment on some better hardware (e.g. 9800GTX or GTX280). I would be very grateful for 2-3 runs (takes about 2-3 mins for each run). Just copy the output to a textfile or post directly here.

Thank you so much.
experiment.zip (304 KB)

michas9 · November 29, 2008, 11:34pm

Results from my 8800GT:

Initializing GPU…
Using device 0: GeForce 8800 GT
Loading signal…
Loading dictionary…
Running algorithm on CPU…
-running initial inner product…DONE! (1.656s) (1.64)(1.656)
DONE! Took 52.672 secs… (52.532)(52.5)
Running algorithm on GPU…
-allocating memory…DONE!
-uploading data…DONE!
-running initial inner product…DONE! (0.171s) (0.172)(0.188)
-loop:
-finished!
DONE! Took 19.094 secs…(18.89)(18.922)
Deinitailizing GPU…

Max5605 · November 30, 2008, 2:14am

With a GTX260 I’m getting 0.109s for the initial inner product and 18.469s for the loop.

danijel · December 5, 2008, 3:12pm

I just wanted to thank you guys for help and close this thread.

I will keep working on my code, as the improvement seems to be still quite below linear to number of concurrent threads on the gpu.

liv · December 5, 2008, 4:37pm

Unfortunately, according to our NVidia contacts, the 16kb shared memory limitation would not change in the near future.

Topic		Replies	Views
Effective global memory bandwidth? CUDA Programming and Performance	17	17569	September 18, 2007
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13629	June 2, 2008
Putting the GPU at work CUDA Programming and Performance	21	20174	July 5, 2007
Newbie - Need to use shared mem? CUDA Programming and Performance	27	14988	December 17, 2008
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16300	January 30, 2011
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5076	September 6, 2008
Maximum number of queued kernels CUDA Programming and Performance	21	7754	September 3, 2008
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9347	January 7, 2008
Is my kernel too simple to get a speed increase from CUDA? CUDA Programming and Performance	18	3826	February 2, 2010
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29323	April 23, 2010

Speed improvement

Related topics