Interleaving cudaMalloc and kernels on multiple cpu threads - performance?

How well do cudaMalloc and cudaHostRegister perform when called (somewhat) frequently from multiple threads?

The application has the following structure:

phase 1: loads a substantial data structure (several GB) to the GPU.
This structure is read-only going forward.

phase 2: several cpu threads, each of which:

01   loop several hundred times: 
02      create a stream
03      allocate 3 pairs of associated memory blocks (~ 3MB each)
04          (so that's 3 cudaMalloc's and 3 malloc's)
05          pin the malloc'ed memory blocks (3 calls to cudaHostRegister)
06      loop a few hundred times doing
07         set up inputs (< 16 K) and memcpy them host->dev
08         invoke kernel (~ 20 ms running time)
09         memcpyAsync outputs to dev->host (2 MB)
10         (a callback is used to process the output and recycle the blocks)
11      unpin and free memory (3 calls each to cudaFree, cudaHostUnregister and malloc)
12      destroy the stream
13      Perform a cpu-intensive operation ( takes about 1/8 of a second )

GPU usage is low (according to Nvidia Visual Profiler, around 20%).
CPU usage only modest, < 75%.

So neither CPU nor GPU seems to be the bottleneck.

Adding more threads doesn’t affect the numbers. It seems there is a bottleneck on a non-processor resource. My guess is its the memory handling on lines 03,04,05 and 11. The visual profiler (which I’m new to) hasn’t been much help on this.

How well do cudaMalloc and cudaHostRegister perform when called in this manner from multiple threads? Does cudaMalloc cause blockages? (maybe device syncs?) Is it worth pre-allocating and re-using these smallish blocks?

The current thread count is fairly low - about 4 - but the intent is to ramp this up significantly (the current test system has a small cpu core count)

Thanks

IINM, cudaMemcpy does device sync. Use cudaMemcpyAsync and sync with cudaStreamSyncrhonize where necessary.

cudaMalloc is usually a synchronizing operation.
cudaMalloc has some cost, so it’s often recommended that you attempt to do allocations at the beginning of your application, and not in time-sensitive areas (e.g. loops). Reuse the allocations.

I would also suggest pinning memory once and re-using pinned allocations.

do you read http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#asynchronous-concurrent-execution ?

overall, you may need to

  1. alloc all buffers at the start
  2. use CUDA streams to break false dependencies between operations performed in different CPU threads
  3. use only pinned memory buffers in order to allow asynchronous transfers (but multi-gigabyte pinned buffers is a way too much, so you may need to move data in smaller chunks via intermediate pinned buffer)

in particular, read http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#explicit-synchronization and the next topic, http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#implicit-synchronization

All the memcpy’s are already with cudaMemcpyAsync, sorry if that wasn’t clear.

Just to clarify - does it sync just the stream … or the entire device? Each CPU thread is using it’s own stream.

Thanks.

Edit: never mind - it has to be the entire device, since cudaMalloc has no associated stream.

yes, it is the entire device.