How well do cudaMalloc and cudaHostRegister perform when called (somewhat) frequently from multiple threads?
The application has the following structure:
phase 1: loads a substantial data structure (several GB) to the GPU.
This structure is read-only going forward.
phase 2: several cpu threads, each of which:
01 loop several hundred times:
02 create a stream
03 allocate 3 pairs of associated memory blocks (~ 3MB each)
04 (so that's 3 cudaMalloc's and 3 malloc's)
05 pin the malloc'ed memory blocks (3 calls to cudaHostRegister)
06 loop a few hundred times doing
07 set up inputs (< 16 K) and memcpy them host->dev
08 invoke kernel (~ 20 ms running time)
09 memcpyAsync outputs to dev->host (2 MB)
10 (a callback is used to process the output and recycle the blocks)
11 unpin and free memory (3 calls each to cudaFree, cudaHostUnregister and malloc)
12 destroy the stream
13 Perform a cpu-intensive operation ( takes about 1/8 of a second )
GPU usage is low (according to Nvidia Visual Profiler, around 20%).
CPU usage only modest, < 75%.
So neither CPU nor GPU seems to be the bottleneck.
Adding more threads doesn’t affect the numbers. It seems there is a bottleneck on a non-processor resource. My guess is its the memory handling on lines 03,04,05 and 11. The visual profiler (which I’m new to) hasn’t been much help on this.
How well do cudaMalloc and cudaHostRegister perform when called in this manner from multiple threads? Does cudaMalloc cause blockages? (maybe device syncs?) Is it worth pre-allocating and re-using these smallish blocks?
The current thread count is fairly low - about 4 - but the intent is to ramp this up significantly (the current test system has a small cpu core count)
Thanks