Execute different kinds of code in one MP


I want to port a Genetic Programming algorithm to CUDA. I have several “species” of algorithms that should co-evolve (and have different goals); to find out the fitness of candidate algorithms, one of each of these is assigned to a test run. Several test runs are simulated with random data; then, the fitness of each algorithm can be evaluated.
The fittest algorithms are promoted; they are copied everywhere, and some mutations are introduced. The process is repeated.

My naive approach is to give one multiprocessor to one test run. In the multiprocessor lives one candidate for all the species. Each streaming processor simulates one test run.

Now, I have heard that it can be bad to execute conditionals, because in some cases, both branches have to be executed. Is that also a problem for the long term? I.e.: If I have
a branch at the beginning of my kernel depending on the thread number (so that one thread can be species X and another thread can be species Y), will all the code be executed on
both branches indefinitely or will this just happen for some clock cycles?

Regardless of that: Would you my basic plan differently?

Thanks in advance.

I’d assume you test the algorithms on a few different sets of data. If so, have each block (or at least each warp) run the same algorithm on a number of different data sets.

Conditionals where both branches are indeed expensive (the two branches are executed one after the other, for their full duration and not just a few cycles). You may want to read up the chapters about the programming model and its hardware implementation in the Programming Guide.

Thank you! Unfortunately, that would be suboptimal. In one test run, all of these species have to exchange data, so I would like that data exchange to be quick and the global bus

to be uncrowded. Any suggestions?

Try to understand the fundamental programming model before worrying about the details.

You can have different species in one block, as long as every warp (32 threads) runs the same code path. Global communication always requires going global memory in CUDA.

So one block can be divided across several multiprocessors? I read here that one MP in a Tesla card has 8 stream processors, which, as I understand, means that one multiprocessor only has space for one warp. The thing I wanted to do is take advantage of the shared memory in one multiprocessor; because most of the data flow is localized in

one test run, this should give me speedup in comparison with all MPs having to use the same global memory bus. Is that possible/useful?

Don’t rely on random sources from the internet. Read the CUDA C Programming Guide from Nvidia instead, which is the definitive source. At least read the short chapters 2 and 4 before continuing to think about your implementation, because CUDA is quite different from what you are describing above.

With current, official documentation, things start becoming clear. Thanks.