design level question for cuda pattern matcher with respect to contexts

I am designing a pattern matcher using cuda.

I am having 2 designs in mind.

Basically there are around 1000(dummy value) separate groups of patterns, with each group of patterns containing it’s own set of patterns. The pattern matcher initialization code is designed such that it takes each group from this list of 1000 groups and creates structures that are relevant to the pattern matcher for that particular group. These structures for each group should be passed on to the kernel during execution. These structures that are passed to the kernel are static for that group across kernel invocations. The only parameter that remains variable across kernel invocations is the buffer to be matched, which has to be updated for the kernel, across kernel invocations.

What I am stuck with is whether I should use separate contexts for each of the above 1000 groups or should I use a single context for all the above 1000 groups. IMO using a context each for the 1000 groups should help in multi-threading the pattern matcher for multiple simultaneous requests coming in, than using a single context with a dispatcher that handles all incoming requests and queues them to a single thread, since at a time only one thread can attach to a context.

Design 1:
I use separate contexts for each of the 1000 groups. The advantage of this is that, in a multi-threaded application, the thread which calls the Search() on the pattern matcher, can have the Search() function Push the context for the group that is calling the Search() and independently carry operations on this. Hence multiple requests happen concurrently since threads don’t have to wait on a single context.

Design 2:
A single context for all the above 1000 groups and all search operations are carried out by a single thread to which this single context is attached. A dispatcher thread handles all incoming search requests from other threads and queues on these requests to this main Search() thread.

I would go for Design 1, but would creating 1000 contexts cause any issues? I will be writing code to test this shortly to see how this scales.

Any help on this is appreciated. Thanks for taking time to read this.

  • bump * sorry for bumping it.

I understand that my question is slightly long. Just in case if the original question is a bit vague, a quick question. Would it be an issue, if I have around 1000 contexts in my application or is there a limit on the no of contexts we can have in an application? Thanks in advance.

A 1000 contexts sounds like a highly impractical idea, whether the driver could actually handle that many or not (I have no idea what the limit might be if there is one). The problem is that context switching isn’t free. On some platforms it seems rather expensive. Having that many contexts competing for resources on a single device simultaneously sounds like a ill-advised idea to me, but I have never tried anything as exotic as that.

When you mean context switching, you mean swapping the memory associated with a context or you mean making a context current to a host thread by pusing it and popping it? Is this pushing and popping expensive. Because if I have 1000 contexts, the data that I would store in these 1000 contexts wouldn’t cross a few MBs. Maye around 4 MB all the contexts combined. And can I queue 2 simultaneous kernel executions from different host threads that are using 2 different contexts(will try this too)?

Your plan is broken for a million reasons that I am a little too lazy to explain thoroughly, but the gist of it is that contexts aren’t free and you won’t be able to allocate a thousand (and trying will probably make things break).

If you want to do this the right way, use the driver API, context migration calls and a single floating context.

Right. Thanks for the info tmurray and avidday.