Question about Pinned memory


I have created a pipelined workflow system API ( and am in the process of testing it on a block decomposition for general matrix-matrix multiplication (GEMM).

My workflow consists of the following tasks:

(1) Read from disk for matrices A and B
(2) Copy from CPU memory to GPU memory
(3) Detect when two matrices loaded are ready to initiate GEMM
(3) Execute GEMM using cuBLAS
(4) Copy from GPU memory to CPU memory
(5) Accumulate results on the CPU

All of these tasks run on separate threads and run asynchronously using separate streams. My intention is to overlap the various I/O sections with computation.

I pre-allocate a pool of memory buffers that are used for both of the copy operations. I want to use cudaMemcpyAsync for this, so they are allocated using cudaMallocHost to get pinned memory.

Here is a brief example of how the workflow processes the computation:

cudaMallocHost(devA ... );
cudaMallocHost(devB ...);
cudaMallocHost(devC ...);

cudaMemcpyAsync(devA, cpuA, streamA)

cudaMemcpyAsync(devB, cpuB, streamB)

cublasSetStream_v2(handle, streamABC)
cublasDgemm_v2(devA, devB, devC ..)

What I am seeing is THREAD1 and THREAD2 are completing quickly. THREAD3 is using the GPU at 100% utilization but is taking a very long time to compute the matrix multiplications. I suspect that the pinned memory is the cause for this and it is thrashing with memory.

If that is the case, I could use cudaMalloc to allocate devA, devB, and devC; however, this would make it so I cannot overlap the PCI express transfers with computation.

One alternate method I am considering is to allocate a piece of memory using cudaMallocHost for each copy thread and then use a pool of memory allocated with cudaMalloc. This way, I would copy from host to device using the cudaMallocHost piece and then do a copy kernel from the pinned memory to the cudaMalloc device memory.

Anyways, I am fairly perplexed by this situation and how to get around it so any input would be greatly appreciated. The full test case/source code is found here:

You will need openBLAS and the HTGS API to compile the test suite: and

This workflow is designed to test the API and I intend to compare it with cuBLASXT.

Also wanted to share the runtimes I am currently getting:

multiplying two 16k x 16k matrices using GEMM:

CPU-only with openBLAS: 41945.2 ms
GPU with workflow above: 1.13477e+06 ms

I implemented the alternate method and ran some tests and it is now running much faster.

30100 ms with the workflow versus 1.13477e+06 ms in the older version.

I think the GEMM call was paging A LOT, so that would definitely explain why.

Quick overview of how I fixed this issue:

First, allocated one piece of reusable pinned memory.

Then I copy the data from host to device to that pinned memory. After that I copy the pinned memory from device to device to a piece of memory allocated with cudaMalloc.

If anyone has better suggestions for this I will be happy to implement and test it!

well, since matrix multiplication is pretty slow operation, your entire architecture just raises complexity without real benefits. use cudaMalloc to alloc GPU memory, copy data between cpu buffers and GPU memory with simple cudaMemcpy call, use simple I/O via fread or so. anywat, it will be much faster than multiplication itself

Have you measured the memory bandwidth for the pinned host-device and device-host copies? For PCI-E 3.0 x16 you should be seeing about 13 GBs each direction, but if you have an older motherboard or a CPU with too few lanes you actually could be getting 1/2 to 1/4 of that speed.

For a dense matrix of 16k x 16k those memory operations are going to be expensive. Also it seems you are using 64-bit doubles which really need a Tesla GPU to get better performance.
Do you really need double precision, as GPUs use FMA operations for 32-bit which are usually accurate enough (When compared to 64-bit matrix multiply MATLAB ).

Which GPUs are you using for your tests? Which version of CUDA?

The results from the fix I provided show that for multiplying 16k x 16k matrices without the API took ~54.6 seconds, and with the API it took 30.1 seconds.

For 32k x 32k matrices no API took 433.3 seconds, whereas with the API it took 230.5 seconds.

The versions that did not use the API were implemented using your suggested methodology of using cudaMalloc + cudaMemcpy + fread… etc. The API allowed for the memcpy and the fread to overlap with computation.

So, yes there are real benefits to this API. Particularly for matrices that fit into CPU memory, but not into GPU memory. The API provides mechanisms for pipelining to overlap computation with I/O. The PCI express should not be pushed aside, in many cases it is the determining factor in getting performance. The best method to reduce the impact of the PCI express is to overlap computation with it, which is precisely what the API strives to enable. Also the general overhead of my architecture is extremely lightweight, so there is minimal overhead with scheduling data between the tasks.

Next, I am going to run the matrices using the cuBlasXT module to see how it compares.

Right now I’m using a Tesla C2075 and CUDA 7.5. I will see about computing the effective bandwidth I am getting from my copy operations. That will be a useful statistic to take a look at.

My motherboard has two Xeon E5-2600 v3 sockets which each have 40 lanes per CPU, so its enough to fill two GPUs per CPU socket.

It seems the issues I was facing were the fact that the memory that was being used for the matrix multiplication was pinned memory so a lot of page faults were occurring. Especially considering the block decomposition and the matrices were out-of-core for the GPU memory.

I have a machine with two K40s and the same CPU socket I will be testing on shortly.

For 32k x 32k matrices no API took 433.3 seconds, whereas with the API it took 230.5 seconds.

i.e. just copying few GB/s betwenn CPU and GPU tooks 200 seconds? check how much time cudaMalloc/cudaFree spends. or just try to use cudaMalloc/cudaFree only once, outside of cycle

When multiplying 32k x 32k matrices, each matrix is ~7.6 GB in size for 64-bit precision. So total between the three matrices (the two multiplied and the result matrix) is ~22.8 GB. Because of this aspect I have to decompose the matrices using block decomposition and do matrix multiply on that as it is not possible to fit the full matrices into memory. Also I only allocate once during initialization of the tasks, and free memory once during shutdown.

The reason you are seeing the performance difference between the two is with my API I am able to overlap the computation with data transfer costs. Not only that, but the version I have uses multiple threads for reading, copying, matrixMul, and accumulating the blocks. But yes the cost of the PCI express could be accounted for within that 200 second period, particularly considering the number of blocks that must be re-copied due to the size of these matrices.

Also what do you mean by “outside of cycle”?

Now I plan on using the cublasGEMMXT library tomorrow as a proper benchmark as that version will do the block decomposition internally and will work on any size matrix that fits into CPU memory. I dont anticipate having better performance than that implementation, but it will shed some light as to the overhead of using my API.