Irregular access to a huge matrix that cannot feed in to device memory

wtan · June 13, 2015, 4:36pm

Hi I have an application that has M columns and each column with F rows. This M*F matrix can be 20+ GB and therefore beyond the capacity of any single GPU we see.

In the kernel, each thread irregularly and sparsely accesses many columns of among the M. Any suggestion on how to speed up this type of application?

put the matrix in host memory and use unified memory so that the kernel can access. This is to be slow.
stage the matrix in batches in device memory. Since the access is irregular, each thread cannot complete until you stage the all matrix in.

Thanks!

Robert_Crovella · June 13, 2015, 4:57pm

The third option I can think of is to use device-mapped pinned host memory, also called zero-copy. Irregular, limited sparse access to huge datasets is a canonical use case for zero-copy techniques. There are cuda samples that can introduce you to the concept:

[url]http://docs.nvidia.com/cuda/cuda-samples/index.html#simplezerocopy[/url]

Unified memory will not work if the data set that is referenced by the unified memory pointers is larger than available GPU memory.

A drawback of zero-copy is that data accesses can be no faster than the speed of the PCIE bus, since data accessed by the GPU in this fashion is resident in host memory, and will be transferred to the GPU upon access via PCIE. If you only access a given location once in your GPU kernel, however, this is not likely to be any less efficient than transferring the data once (e.g. via cudaMemcpy) which also must cross the PCIE bus. Therefore careful coding techniques, to only access a specific global location once (and also coalesce access if possible) will help mitigate this drawback of zero-copy.

njuffa · June 13, 2015, 5:28pm

You did not mention the data type of the elements of the matrix, nor the details of the required access pattern. You may be able to perform your computations in ‘float’ instead of ‘double’, as part of a mixed-precision computation, which frequently works for iterative methods that require double precision only in the final refinement step. You may be able to split the matrix into tiles or panels, along the lines of traditional out-of-core sparse matrix methods, such that the portion resident on the GPU at any time does not exceed the 12 GB per-GPU maximum of high-end cards.

wtan · June 13, 2015, 6:20pm

Thanks. This is an option I would try, but definitely not the optimized one. The reason is, that, the matrix is huge and each thread accesses a “sparse portion” of it. If texture memory can be bound to host memory, that might be better.

The third option I can think of is to use device-mapped pinned host memory, also called zero-copy. Irregular, limited sparse access to huge datasets is a canonical use case for zero-copy techniques. There are cuda samples that can introduce you to the concept:

http://docs.nvidia.com/cuda/cuda-samples/index.html#simplezerocopy

Unified memory will not work if the data set that is referenced by the unified memory pointers is larger than available GPU memory.

A drawback of zero-copy is that data accesses can be no faster than the speed of the PCIE bus, since data accessed by the GPU in this fashion is resident in host memory, and will be transferred to the GPU upon access via PCIE. If you only access a given location once in your GPU kernel, however, this is not likely to be any less efficient than transferring the data once (e.g. via cudaMemcpy) which also must cross the PCIE bus. Therefore careful coding techniques, to only access a specific global location once (and also coalesce access if possible) will help mitigate this drawback of zero-copy.

wtan · June 13, 2015, 6:25pm

Thanks for the comments. I used float instead of double in my computation. The M*F matrix can be arbitrarily huge in many cases, therefore using float, or even half-precision would not work. (I am using K40 which does have 12GB memory.)

I may need to read the code on sparse matrix manipulation. For now I think split the matrix is also difficult because each thread’s access is irregular and spans across the whole matrix. Basically, I need device global memory serving as a “read-only cache” for a huge host memory.

little_jimmy · June 15, 2015, 4:48am

“each thread’s access is irregular and spans across the whole matrix”

i would think that this would surface as a performance bottleneck, sooner or later
hence, any possible improvement in this regard should yield gains

one option may be to revise the algorithm, such that, even if they are irregular, accesses by thread blocks still generally fall in the same block/ tile/ region
in many cases, the input/ factors are not purely random, but semi-deterministic (or semi-non-deterministic then)
another option may be have the host support the device by pre-processing the data before passing it to the device, concurrently with the device executing the algorithm

tera · June 16, 2015, 12:29am

Even though it would be the more performant approach, you cannot directly do your suggestion 2), as there is no guarantee that your threads will all run in parallel.

However, you can run separate kernels for each stride of the input matrix, where threads dump their state into device memory at the end of each kernel and the next kernel picks up from the previously dumped state.
That of course assumes you have got enough device memory to keep the state for all threads.

wtan · June 21, 2015, 8:58pm

Thank you Tera. What you proposed is indeed a solution. However, since the intermediate state is much larger than the final result, this staged solution is not going to work well.

BTW: I tried using zero-copy host memory. As expected, it works but is VERY slow.
Any idea if CUDA would support something like “a cache on device memory, for host memory”?

Topic		Replies	Views
Changing the order(permuting) matrices CUDA Programming and Performance	7	1200	January 20, 2015
Performance effects of pinned memory CUDA Programming and Performance	5	1002	January 27, 2011
achieve inter process communication, host<->GPU to avoid copying between global and shared m CUDA Programming and Performance	2	665	December 20, 2011
Passing matrices to kernel Only Allocate for memory on device CUDA Programming and Performance	3	1484	April 6, 2010
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4483	October 24, 2008
2D Matrix operation CUDA Programming and Performance	5	2081	January 26, 2015
Mid-size matrix arithmetic in Cuda Is cuda appropriate? CUDA Programming and Performance	3	2663	June 6, 2008
How can we efficiently perform batch copy from CPU to GPU, initiated by the CPU?or using an asynchronous approach CUDA Programming and Performance cuda	6	3448	June 17, 2023
Matrix multiplication CUDA CUDA Programming and Performance	7	2892	November 12, 2012
Global memory access bottleneck CUDA Programming and Performance	8	3391	September 4, 2015

Irregular access to a huge matrix that cannot feed in to device memory

Related topics