A (not so) hypothetical question

I am working on a port of some algorithm to cuda. It is not arithmetically involved but can be parallelized on several levels. It also doesn’t work on huge arrays, but rather on quite small ones. My (not so) hypothetical question for all you experienced cuda programmers would be - what performance can I expect compared to the cpu if the cuda version of the algorithm would employ only a single multiprocessor (say 64 threads) using only shared memory with nicely coalesced access. So, the algorithm is memory bound and working on smaller chunks of data. I know you’ll say I am stupid to even try to achieve anything by porting such algorithm to cuda. But considering the memory bandwidth of gpus and e.g. 16-fold parallelism, can I expect at least a comparable performance (i.e.no real speedup but at least no order of magnitude slower run-time)?
To be honest, I already did some testing and the results are depressing, so what I need is a hint if I am trying to achieve the impossible or merely doing something wrong in my implementation. If I knew there is a way to achieve comparable performance of 1:1 mapped algorithms on cpu and gpu, I’d have a good base to continue with further parallelization which should then provide true speedups.
Thanks for any useful comments.

If you only have a single block and you are doing memory accesses with any frequency, there is no way you can hide memory latency effectively.

I think the best way to go for something like this would be to process multiple ‘chunks’ of small data at one time. I doubt you’ll ever get 1:1 performance with a single GPU multiprocessor as compared to modern CPU, but the whole idea behind CUDA is that the speedup comes from being able to complete tasks in a massively parallel fashion (e.g. if your algorithm runs 1/2 as fast on CUDA vs. a standard CPU, but you can process 100 chunks of data at a time…the speedup is still 50x).

So, there may be no way to use multiple processors on a single data chunk (if they’re small), but if you have a bunch of them to process…you should still consider porting to CUDA.

That is my idea exactly, because the small arrays in my algorithm are “kind of” multiplying. The algorithm is originally recursive, the gpu version works by using a stack. So if I execute a few levels of recursion on the cpu, I get a set of smaller arrays that could be processed in parallel. The problem is that my current cuda implementation is not only 1/2 as fast as on the cpu (try 1/20 with no idea of how to improve it further) and I am having doubts. Can I expect superlinear speedup if I occupy all 16 multiprocessors? By hiding memory latency that tmurray mentioned above?

Yes, you will get a better speedup if you can occupy all of the processors, but not if you just ran enough blocks to fill them once…the memory latency gets hidden if you have, say, 2000 blocks, so that the card can start loading the next block of data while the last one finishes computing. Also, if you have to make multiple calls to this kernel, look at using the asynchronous memcpy methods, which can help to hide some of the latency as well.

Also, if you could elaborate a little more on your algorithm, perhaps someone would be able to suggest a better way for you to parallelize it. From the small bit of information in your last post, it sounds like you are doing a reduction-type algorithm. So perhaps instead of using a stack on the GPU (probably not too efficient), you could use some sort of tree-based recursion method (again, it’s hard to suggest anything without knowing more about what you’re trying to do).

Thanks for your post. Actually, my algorithm is a recursive tree traversal algorithm. Using a portion of shared memory as a stack onto which the nodes of the tree are pushed/poped as the tree is traversed by the algorithm is the only way I can think of to avoid recursion on the gpu. Sounds like you have a better idea so I would very much like to hear it.

Again, I guess it kind of depends on the specifics of your algorithm, but if you know what the maximum depth of your tree is, and the maximum number of possible branches at each node, you could probably use something like the reduction code (see the SDK example, especially the accompanying documentation) to do the same thing without using a ‘stack’ method. If you’re doing some kind of search, you may be able to make some kind of struct that has the node index key, and a numeric or boolean value for the search result (or use the top bit of the key field if it’s numeric). Then you could do a reduction or scan (I think that’s also an SDK example, if not, it’s in under the compact array method) to get the matching values.