I need to compute the cross-product of a very huge matrix X (20GB) with a vector y. The data is too large to be fully loaded into physical memory. I have used memory mapping technique plus OpenMP to speed up the computation, but still wish to accelerate the computation further since that is the smallest unit of my whole algorithm which has 100 iterations.

I am new to GPU computing, but just think out loud here. My questions are:

would GPU parallel computing help in the data-larger-than-RAM cases? If so, are there existing successful applications in such cases?

Are there any good benchmarking experiments about GPU computing?

There are certainly people who have worked on out-of-core matrix computations with CUDA, but it is not clear to me what of that work is relevant to your specific use case. Generally speaking operations on large matrices can be broken down into operation on tiles or panels that can be made to fit GPUs of various capacity. What kind of matrix is this, dense or sparse? Does it have a particular structure (e.g. tri-diagonal, banded, triangular)? Is the data single-precision or double precision?

At this point, thousands of papers have been published on work with GPUs that contain performance numbers. Google Scholar (scholar.google.com) lets you find relevant papers with ease.

Thank you very much for the quick reply, njuffa. Below is the CUDA device information for my Mac. My goal here is to enable the computation with ordinary computer. It would not be my case for purchasing new devices.

The matrix I work with is dense, double-precision, with over 2500 rows and 1 million columns. There is no particular structure. However, I preprocessed it into a new binary file, which stores all elements of the matrix into a loooog vector.

I will check out papers on google scholar, but could you let me know if you have some ideas about how to do the GPU computing on my machine? Thanks a lot!

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GeForce GT 750M”
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147024896 bytes)
( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
GPU Max Clock rate: 926 MHz (0.93 GHz)
Memory Clock rate: 2508 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GT 750M

Break your A matrix into contiguous sets of rows, however many rows such that 2 such chunks will easily fit into GPU memory. Your GT750M has 2GB of memory, so let’s choose 1.2GB for this, i.e. 600MB per chunk. Since your matrix has 1M columns this would be 75 rows per “chunk”.

You would copy the first chunk of A (i.e. rows 0 to 74) to the GPU, plus copy the entire x vector (1 million double elements, so 8MB storage for that). Also need to allocate storage for 2 segments of b vector, each 75 rows long. You would then write a pipeline loop to do the following:

Execute a cublas gemv operation on the current chunk A’x = b’ in stream S1.

Execute a H->D cudaMemcpyAsync operation to transfer the next chunk of A to the device in stream S2

Execute a D->H cudaMemcpyAsync operation to transfer the results of the previous (iteration, not the one launched in step 1 of this iteration) gemv operation (b`) to the host in stream S3.

Rotate the definition of S1->S2->S3

repeat the steps for the next chunk.

This is just an outline of the general idea. I’m not trying to recite an exact recipe here.

Whether or not this will be efficient or useful on a GT750M (a GPU with 2 Kepler SMs, without particularly good DP floating point perf.) I can’t say. I have my doubts. But the plumbing to create the pipeline loop is fairly straightforward, and arithmetically the problem is easily separable.

Regarding the processing pipeline, there are many tutorials on the web which explain it some detail. Here is one such presentation:

Probably I should be a little more emphatic. If I understand your case correctly, I have strong doubts that using a GPU would be useful.

matrix-vector multiply has a relatively small amount of data-reuse and compute intensity, per byte of data transfer.

If you are working with an iterative solution method (e.g. conjugate gradient) that will re-use the A matrix, then the GPU can certainly provide interesting benefits. But if you have exactly one matrix-vector multiply to do per A matrix, and the A matrix is on the host and needs to be transferred to the device, the GPU is unlikely to be interesting for this operation. The data reuse and compute intensity per byte of transfer are too low.

Working with a matrix that won’t fit in GPU memory, necessitating transfer for each iteration, would wipe out any benefit even in an iterative algorithm like CG that could re-use the A matrix.

Furthermore, the GT750M has particularly low throughput for double-precision operations like multiply and add, on the order of 30GFlops/s. It’s quite possible that your CPU has higher throughput, and you might be better served simply by using a good CPU BLAS library like MKL or OpenBLAS.

If you had multiple RHS vectors to compute per A matrix, things could possibly be different, although I doubt that the GT750M would ever be that interesting on double-precision calculations.

Thank you very much for your detailed guidance, txbob!! Really helpful!

I was also thinking that GPU might not benefit too much in this case since data-reading is the bottleneck. The computation time is almost negligible as compared to the I/O read.

And yes, I am working with an iterative algorithm. Basically, the workflow looks like following.

Reading in the big matrix A and vector x, as in your example;

Compute matrix-vector multiplication, b = Ax

Do some other computation to update x’;

Redo cross-product, b’ = Ax’;

Repeat Step 3 and 4 for 100 times.

So the A-matrix is re-usable. But still, I am concerned whether GPU would speed up the computation or not. I am going to do some test. Please let me know if you have any other comments. Thank you very much!