Hi I have an application that has M columns and each column with F rows. This M*F matrix can be 20+ GB and therefore beyond the capacity of any single GPU we see.
In the kernel, each thread irregularly and sparsely accesses many columns of among the M. Any suggestion on how to speed up this type of application?
put the matrix in host memory and use unified memory so that the kernel can access. This is to be slow.
stage the matrix in batches in device memory. Since the access is irregular, each thread cannot complete until you stage the all matrix in.
The third option I can think of is to use device-mapped pinned host memory, also called zero-copy. Irregular, limited sparse access to huge datasets is a canonical use case for zero-copy techniques. There are cuda samples that can introduce you to the concept:
Unified memory will not work if the data set that is referenced by the unified memory pointers is larger than available GPU memory.
A drawback of zero-copy is that data accesses can be no faster than the speed of the PCIE bus, since data accessed by the GPU in this fashion is resident in host memory, and will be transferred to the GPU upon access via PCIE. If you only access a given location once in your GPU kernel, however, this is not likely to be any less efficient than transferring the data once (e.g. via cudaMemcpy) which also must cross the PCIE bus. Therefore careful coding techniques, to only access a specific global location once (and also coalesce access if possible) will help mitigate this drawback of zero-copy.
You did not mention the data type of the elements of the matrix, nor the details of the required access pattern. You may be able to perform your computations in ‘float’ instead of ‘double’, as part of a mixed-precision computation, which frequently works for iterative methods that require double precision only in the final refinement step. You may be able to split the matrix into tiles or panels, along the lines of traditional out-of-core sparse matrix methods, such that the portion resident on the GPU at any time does not exceed the 12 GB per-GPU maximum of high-end cards.
Thanks. This is an option I would try, but definitely not the optimized one. The reason is, that, the matrix is huge and each thread accesses a “sparse portion” of it. If texture memory can be bound to host memory, that might be better.
Thanks for the comments. I used float instead of double in my computation. The M*F matrix can be arbitrarily huge in many cases, therefore using float, or even half-precision would not work. (I am using K40 which does have 12GB memory.)
I may need to read the code on sparse matrix manipulation. For now I think split the matrix is also difficult because each thread’s access is irregular and spans across the whole matrix. Basically, I need device global memory serving as a “read-only cache” for a huge host memory.
“each thread’s access is irregular and spans across the whole matrix”
i would think that this would surface as a performance bottleneck, sooner or later
hence, any possible improvement in this regard should yield gains
one option may be to revise the algorithm, such that, even if they are irregular, accesses by thread blocks still generally fall in the same block/ tile/ region
in many cases, the input/ factors are not purely random, but semi-deterministic (or semi-non-deterministic then)
another option may be have the host support the device by pre-processing the data before passing it to the device, concurrently with the device executing the algorithm
Even though it would be the more performant approach, you cannot directly do your suggestion 2), as there is no guarantee that your threads will all run in parallel.
However, you can run separate kernels for each stride of the input matrix, where threads dump their state into device memory at the end of each kernel and the next kernel picks up from the previously dumped state.
That of course assumes you have got enough device memory to keep the state for all threads.
Thank you Tera. What you proposed is indeed a solution. However, since the intermediate state is much larger than the final result, this staged solution is not going to work well.
BTW: I tried using zero-copy host memory. As expected, it works but is VERY slow.
Any idea if CUDA would support something like “a cache on device memory, for host memory”?