Matrix . Vector Multiply then Max of the resultant Vector

Firstly I apologize if this is a very basic question, but I am very new to CUDA, so asking this. I would like to achieve this:

  1. Load a matrix M of size m.w from CPU to GPU
  2. Load a vector W of size w from CPU to GPU
  3. Multiply M and W, which will result in a new vector V of size m (all in GPU)
  4. Calculate maximum scalar value X from vector V (all in GPU) and then copy scalar X from GPU to CPU

In the above

  • m can be of size 2 million to 10 million
  • w can be of size 100 to 10,000.
  • Input Vector W is made of floating point numbers, between 0.0 and 1.0.
  • Matrix M, Vector V, and Scalar X output are all made of positive or negative floating point numbers

Unless you want to learn CUDA its best of use libraries for a data size as large as you have mentioned

Use a library like cuBLAS for matrix multiplication.
Use a library like thrust for the maximum operation.

@rreddy78 thanks for your reply.

From their respective websites I found these:
Matrix Multiply: cublasSgemv or cublasDgemv
VectorMaximum: thrust::max_element

Few further questions are:

  • Can matrix multiply and vector maximum be done without offloading the data from the GPU, as I only need the scalar output?
  • How much memory do you think is needed for the above scenario?
  • I see there are two cublasSgemv or cublasDgemv, one for single and one for double floating point precision. Are there performance differences when using GTX or RTX GPUs for these?

The performance of a matrix-vector multiply is limited by memory throughput. While GPUs generally provide higher memory throughput than the ordinary system memory of the host system, in this case it seems only a single such operation is offloaded to the GPU. For this both the matrix and the vector have to be sent to the GPU across the PCIe interconnect, which offers at best around 12 GB/sec throughput for a PCIe gen3 x16 link and 25 GB/sec for a PCIe gen4 x16 link.

So unless there is a significant amount of additional work that is performed on the data shipped off to the GPU, the overall effect of off-loading the work to the GPU will be a slowdown, i.e. you are better off performing this work on the host system.

@njuffa Thanks for your suggestions. The actual use case involves updating W vector from host to device and checking the results repeatedly for maybe millions of iterations. So, I think there is a benefit of using GPU, whats your view?

Note, matrix M does not change in these iterations.

I wonder whether this can be done in multiple threads parallelly, as in I can send multiple W vectors to a function to get the output.

Here are some of my thoughts (which may be corrected by others as I am a beginner in CUDA too) :

  1. Data has to be present in GPU (device) memory for the computation to occur (except in integrated GPU systems where memory is shared). For the discrete GPU’s I believe that the feature of Unified Memory should automatically manage the migration of data between CPU and GPU. But that is an overhead.

  2. The memory required is quite huge. You can do the calc. If you use float data type (which is 4 bytes in size), then m is at max config ~ 372GB!

  3. Definitely using float will be significantly faster than double (which is 8 bytes in size). In fact you can check if the smaller 16bit floating point format can fit the use-case.

Another point that I know is : its possible to overlap kernel execution with data transfer in CUDA to reduce the overhead. But I do not know of how to make use of this with libraries.

@rreddy78 thanks. Its the matrix M that has numbers, I can scale them to fit into 4 byte float. vector W is already made of values 0.0 to 1.00 and can use 4 byte float for it too. I read somewhere that there is no speed gain between 8 byte double or 4 byte floats with GTX, But there is gain with RTX. Given the size of the dataset it is best to use 4 byte floats from memory usage perspective.

I will execution of the above steps for new set of vector W values. matrix M does not change between any of the executions. How can I use multi threading to calculate parallelly?

Certainly float should be preferred over double as it saves memory bandwidth.

Coming to a possible method some of my initial thoughts (from a novice perspective only !)

  1. M is a constant, but it very large and won’t fit into GPU memory at a once. So it has to be sent part by part column wise and partial result vector computed. For this cuBLAS can be used. The library will automatically parallel compute this on GPU

  2. There can be two buffers in GPU for the partial M. When one is being used other can be filled from host side. This would remove overhead of data movement.

  3. Storage format for M on host might have to be column wise since M has to be split column wise for transfer.

  4. Once entire result vector is computed, can apply the thrust maximum on it and get result

cublasXt may be of interest here. However I concur with the comments made by njuffa.

If this is the only operation you intend to do on the data, a matrix-vector multiply may not be that interesting by itself offloaded to the GPU. And although the comment that “I will execution of the above steps for new set of vector W values. matrix M does not change between any of the executions.” would normally make the problem more interesting, if the M matrix will not fit in GPU memory, then I think the situation is still questionable, from a performance perspective. If the M matrix will not fit in GPU memory, then each time you do a matrix-vector multiply with it, the entire M matrix will have to be transferred to the GPU. That is going to reduce performance.

If all of the W vectors are available at once, then this becomes not a matrix vector multiply problem, but instead a matrix-matrix multiply problem (treating all the W vectors collectively as a single matrix) and in that case the problem could probably be formulated differently, and perhaps cublasXt could be used to good effect.

@Robert_Crovella thanks for your reply. Unfortunately the W vectors are not known upfront and can have 10,000 factorial permutations (which is way too large and impossible to handle). Thus I’m intending to use a solver instead with its code running in the host. The solver is likely to iterate 1-2 millions times rather than the huge permutations. I may be about to optimise the matrix to fit GPU memory, however would like to know these things:

  • If the GPU has 8GB RAM, how much of it needs to be reserved for inner working of the GPU in order to perform the multiplication and then apply thrust::max_element? I assume the remaining can be used to load the matrix and vector W.

  • Is it possible to use 16 bit floats, something like bfloat16?

  • Will I be able to utilize multiple GPUs and share memory among them in order to perform the above task?

  • Will I be able to utilize host memory for GPU calculations?

  1. I wouldn’t know for sure without trying it. But on an 8GB gpu, as a round estimate, assume you have 7GB to use for data storage. The rest is “overhead”.
  2. Yes, with cublas or cublasLt
  3. Yes, with cublasXt
  4. Yes, with cublasXt

However, you can’t necessarily do all these things at the same time. For example, cublasXt doesn’t support 16 bit operations, that I can see.

If you want to perform decomposition of a large matrix-vector multiply yourself, rather than using cublasXt to handle the decomposition for you, then you could use 16-bit operations in ordinary cublas, or cublasLt.

I’ll repeat: using a large matrix that won’t fit in your ~7GB of GPU memory, and/or doing the decomposition yourself for a single GPU, and/or using cublasXt with a single GPU on a matrix that won’t fit in 7GB, is probably not going to be exciting from a performance perspective.

Using multiple GPUs is not going to change the practicalities very much. Because data movement between system memory and GPUs requires PCIe transfers, you will need a CPU with a lot of PCIe lanes (16 per GPU). In order to feed those, you will need high-bandwidth system memory: at least four DDR4 channels and preferably more.

So the host system would be configured like an expensive high-performance server-class system. But if you have that, you don’t need GPUs for this particular use case; you could simply do the entire work on the host. Systems with 32 CPU cores and 512 GB of system memory across 8 DDR4 channels appear to be readily available.

For a matrix-vector multiply, where the matrix can be split to fit (completely) in the memory of several GPUs, there may be some benefit in the sense that you would only have to transfer the matrix to the GPUs once, and then re-use it for each subsequent matrix-vector multiply. The vector to multiply would be sent to all the GPUs, so the vector transfer cost increases by the number of GPUs used. In this case, the result vector would have to be assembled from pieces on each GPU. But since the subsequent operation would be a max-finding op, this can be split in pieces, meaning the reassembly cost can be mostly eliminated also.

If the matrix cannot completely fit in the memory of the several GPUs, then I would go back to my previous comments: this doesn’t look to be an interesting op to perform on the GPU(s).

I would suggest taking a hard look at the higher-level task to be accomplished. In other words, algorithms. I consider it quite unlikely that a massive matrix-vector multiply is the only way to skin this particular kumquat.

@Robert_Crovella , just to understand a bit better on what you and @njuffa reiterated.

Are you saying that its not possible to hide the latency of data transfer between CPU and GPU by doing the compute and transfer in parallel ?

Is this because even if we do so then the bandwidth between GPU SM and its memory is reduced by a proportional amount thus increasing the compute time of the kernel itself ?

There could certainly be overlap. But there is not a whole lot of compute in matrix-vector computation. That is why it is bound by memory throughput. With small GPU memory connected to the host system by a tiny PCIe straw, all components (CPU, GPU, PCIe interconnect) would mostly be busy shuffling data around, concurrently. That is not an efficient use of energy.

Both in terms of execution time and energy usage, compute is dirt cheap while memory access is expensive. Therefore a key to efficient computing is to minimize moving data. In terms of BLAS: Use BLAS-3 operations heavily, use BLAS-1 and BLAS-2 operations lightly.

For an out-of-core scenario (problem size > GPU memory) even BLAS-3 operations can suffer massive reduction in efficiency, but are likely to still provide an overall win since a GPU can beat a CPU by a factor of 10 in FLOPS and a factor of 5 in local bandwidth.