Hi! I have a stream of text input, maximal 10 MB. Now I want to use a special, simple algorithm on it.
Is it useful to set up 1 thread per char (->10M threads) or would the managing overhead be too large?
On a Windows system, if you set up 1000 threads you wait a few seconds. Is setting up 10 M threads in CUDA
also slower than setting up let’s say 50000 threads or is the thread setup done parallely, too?
There’s virtually no overhead of using that many threads. 10M is not nearly too much for CUDA.
But if you use chars, memory access pattern of 1 thread -> 1 char will not coalesce - accesses are coalesced best if each thread reads a word (32 bits). So I’d suggest that you use 2,5M threads, each working on a batch of 4 chars (I suggest you use the built-in char4 type) if the algorithm is simple. If there’s a lot of computation to be done on each char, you might consider using 10M threads but use only 2,5M to load batches of 4 chars into shared memory, then have each of the 10M threads get their individual chars by splitting what’s in shared mem.