Hello,
I have created a pipelined workflow system API (HTGS ·) and am in the process of testing it on a block decomposition for general matrix-matrix multiplication (GEMM).
My workflow consists of the following tasks:
(1) Read from disk for matrices A and B
(2) Copy from CPU memory to GPU memory
(3) Detect when two matrices loaded are ready to initiate GEMM
(3) Execute GEMM using cuBLAS
(4) Copy from GPU memory to CPU memory
(5) Accumulate results on the CPU
All of these tasks run on separate threads and run asynchronously using separate streams. My intention is to overlap the various I/O sections with computation.
I pre-allocate a pool of memory buffers that are used for both of the copy operations. I want to use cudaMemcpyAsync for this, so they are allocated using cudaMallocHost to get pinned memory.
Here is a brief example of how the workflow processes the computation:
cudaMallocHost(devA ... );
cudaMallocHost(devB ...);
cudaMallocHost(devC ...);
[THREAD1]
cudaMemcpyAsync(devA, cpuA, streamA)
syncStream(streamA)
[THREAD2]
cudaMemcpyAsync(devB, cpuB, streamB)
syncStream(streamB)
[THREAD3]
cublasSetStream_v2(handle, streamABC)
cublasDgemm_v2(devA, devB, devC ..)
What I am seeing is THREAD1 and THREAD2 are completing quickly. THREAD3 is using the GPU at 100% utilization but is taking a very long time to compute the matrix multiplications. I suspect that the pinned memory is the cause for this and it is thrashing with memory.
If that is the case, I could use cudaMalloc to allocate devA, devB, and devC; however, this would make it so I cannot overlap the PCI express transfers with computation.
One alternate method I am considering is to allocate a piece of memory using cudaMallocHost for each copy thread and then use a pool of memory allocated with cudaMalloc. This way, I would copy from host to device using the cudaMallocHost piece and then do a copy kernel from the pinned memory to the cudaMalloc device memory.
Anyways, I am fairly perplexed by this situation and how to get around it so any input would be greatly appreciated. The full test case/source code is found here: GitHub - usnistgov/HTGS-Tutorials
You will need openBLAS and the HTGS API to compile the test suite: http://www.openblas.net/ and GitHub - usnistgov/HTGS: The Hybrid Task Graph Scheduler API
This workflow is designed to test the API and I intend to compare it with cuBLASXT.