Efficient repeated copying of a vector

Tuee22 · August 4, 2017, 7:00pm

I have a device float vector of length 784 and I want to copy it 60000 times to a contiguously-allocated array. (i.e. end result is a 60000 x 784 row-dom matrix where each col has 60k repetitions of the same value). I’m currently doing this by calling cublasScopy 60000 times from the host, but this obviously gives really bad performance due to kernel launch overhead.

Is there a way to do this with a single kernel call? Obviously I can write my own but I feel there must be an obvious cublas or other library function I’m overlooking.

PS Bonus points for anyone who can guess what canonical problem I’m working on based on the dimensions above :)

njuffa · August 4, 2017, 7:56pm

First rule of copying: Don’t copy. Why is data replication necessary? I would try thinking about a solution where the data is referenced via pointer, completely eliminating the need to physically replicate it. If you must replicate (for reasons not explained), do it in power-of-two steps: from one row create two contiguous rows, then create four rows from two rows, etc. Much fewer copies, each significantly bigger in size, which should cut down on overhead. Ultimately you will still be limited by bandwidth.

Which gets us back to the first rule of copying …

Tuee22 · August 4, 2017, 8:08pm

The matrix I described above is going to be an input into cublasSgemm. Specifically the calculation is
C = A*B + C, where the original version of C is the matrix described above, where each row is identical. The output of gemm will be overriding the values in C.

This means if I don’t construct C this way before calling gemm then I would have to call a separate kernel to perform the addition step. Doing it this way means I get it “for free” from gemm since I’m calling it anyway to get A*B.

Robert_Crovella · August 4, 2017, 9:52pm

you can do it with a cublas rank-1 update:

[url]cuBLAS :: CUDA Toolkit Documentation

x is your existing length 784 row definition vector
y is a (column) vector of all 1, length 60000.

It should be certainly doable with thrust. I doubt either of these methods would be faster than writing your own kernel.

Tuee22 · August 4, 2017, 10:50pm

I can believe something I’d be capable of writing could outperform Thrust, but I’m surprised you think I could beat a cublas function. Would you say the same thing about e.g. gemm, or is there something specific about this operation?

Robert_Crovella · August 4, 2017, 11:10pm

The operation is memory bound. You want to do the minimum number of memory accesses theoretically possible in order to populate your matrix (or even better, like @njuffa said, don’t do the copying at all, but I don’t have any great suggestions there ATM). That theoretical minimum is 784 reads and 60000784 writes (or possibly 59999784 writes).

If you study the Sger operation carefully (read the arithmetic in the CUBLAS manual, please), and count the number of memory accesses (per element) that would be needed to your matrix, I come up with a minimum of 3: one to write an initial value of 0, one to read, one to write the rank-1 update + sum. I don’t see how you get to optimality with that particular arithmetic.

A thrust device vector allocation automatically also initializes the data to zero. Plus you need at least 1 for the write (copy) operation. Yes, you can work around this with a flat allocation and a thrust device pointer instead of a device vector. So you might be able to get to “optimality” with thrust.

I consider myself a reasonably competent CUDA programmer. One of the implications is that I can write a pretty simple CUDA code and immediately figure out the number of accesses needed or will be generated. With thrust, I would have to write the code, then either wade through a templated library counting accesses, or analyze the result with a profiler, to see if it is really “optimal”. With CUBLAS, it’s a little better, I read the docs. Then I write the code.

With CUDA, I just write the code, and I have a high level of confidence of being pretty near optimal. I’m darn sure in CUDA I can write a kernel that will read exactly 784 elements in an optimized (coalesced) fashion, so exactly 784 float reads, and do exactly 60000*784 coalesced float writes, with no extra reads or writes anywhere.

I’m not anywhere near as confident of getting to that definition of optimality with CUBLAS (I’m quite sure not) and/or with thrust (maybe).

I would most certainly not say the same thing about a sort, or a matrix multiply, or a prefix sum. I think I can write a reasonably good reduction, however, I would always use a library call for sort, matrix multiply, or prefix sum (except for silly cases like a 32-element sort, or prefix sum).

For an algorithm that is memory bound, your goal is to make the minimum possible number of memory accesses, in an efficient way.

Tuee22 · August 4, 2017, 11:36pm

Thanks for this explanation, makes perfect sense for a pure memory operation. And this actually gives me a lot of clarity in general for when to write my own kernel vs use something from the library.

vacaloca · February 14, 2019, 7:37pm

This is exactly what I am looking for, except more like 32k (complex) elements, then, do a 1-to-1 mapped processing with each chunk, and write each segment out… say 16 times to an output array (16*32k writes). The reasoning is I want to then call a CUFFT C2C batched plan on the resulting array.

I’ve looked for some sort of example for this that does it in such an optimal fashion but didn’t see anything relevant. Any pointers welcome.

Robert_Crovella · February 14, 2019, 9:01pm

perhaps this will help:

$ cat t40.cu
#include <iostream>
#include <thrust/complex.h>

const size_t ds = 32768;
const size_t rows = 16;
typedef thrust::complex<float> mt;
const int nTPB = 256;

template <typename T>
__host__ __device__
T my_op(T x){
  return x*x;}

template <typename T, size_t _ds = ds, size_t _rows = rows>
__global__ void k(const T * __restrict__ din, T * __restrict__ dout){
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  T my_val;
  if (idx < _ds){
    my_val = my_op(din[idx]);
    for (size_t i=0; i < _rows; i++){
      dout[idx] = my_val;
      idx += _ds;}
  }
}

int main(){

  mt *d_in, *d_out, *h_in;
  h_in = new mt[ds];
  for (int i = 0; i < ds; i++) h_in[i] = mt((float)i, 1.0f);
  cudaMalloc(&d_in,  ds*sizeof(mt));
  cudaMalloc(&d_out, ds*rows*sizeof(mt));
  cudaMemcpy(d_in, h_in, ds*sizeof(mt), cudaMemcpyHostToDevice);
  k<<<(ds+nTPB-1)/nTPB,nTPB>>>(d_in, d_out);
  cudaMemcpy(h_in, d_out, ds*sizeof(mt), cudaMemcpyDeviceToHost);
  cudaError_t err = cudaGetLastError();
  if (err == cudaSuccess){
  for (int i = 0; i < 5; i++)
    std::cout << h_in[i].real() << "," << h_in[i].imag() << std::endl;
  }
  else
    std::cout << cudaGetErrorString(err) << std::endl;
}

$ nvcc -o t40 t40.cu
$ ./t40
-1,0
0,2
3,4
8,6
15,8
$

This of course writes out the same values to each row, so not that interesting by itself. But you could create something similar to my_op and put it in the inner for-loop in the kernel, and it will modify each row before it is written.

dout[idx] = my_row_op(my_val, i);

vacaloca · February 14, 2019, 11:23pm

Thanks! When I tried a quick test case, it didn’t seem to match what I expected. I realized the issue is that you were reusing the same h_i variable as the output. Once I created/initialized a new output variable and copied ds*nrows back to it, it worked as expected. :)

alvaradasan · August 24, 2023, 5:57pm

Hey! I’ve been strugling as you looking for a good answer, but i realized you can use cublasSgemm with a special multiplication!! This will have better preformance than writing your own kernel or using thrush for sure (or almost i tried it)

The key is first saving an array of the number of copies you wanna do of the matrix (in your case 60000) on the device and fill it with 1’s

Now, consider this matrix multiplication dimensions: 60000x1 X 1x784 = 60000x784

If you multiply the first 60000 1’s elements array as a matrix with 60000 rows and 1 column, with an array of 1 row and 784 columns, you get a matrix of 60000 rows and 784 columns!

Finally, when doing this matrix multiplication, thanks 60000 element array is fullfilled of 1’s, the 784 elements of the array will be just copied in the matrix! I’m not sure, but i think also 1.0f multiplication should not sightly modify float values

In addition, you can use matrix multiplication to sum all columns of a matrix! think it by yourself

Hope this helps you!

Topic		Replies	Views
How to efficeint repeat a vector to a matrix in cuda? CUDA Programming and Performance	4	2162	August 24, 2014
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13975	September 5, 2008
Device matrix copy (CUBLAS/CUDA) How to copy a 2D matrix to another 2D matrix in device memory CUDA Programming and Performance	0	1182	April 15, 2009
CUBLAS matrix-vector multiplication CUDA Programming and Performance	14	10137	January 20, 2010
Multiplication of Symmetric banded Matrix by a Vector CUDA Programming and Performance	5	8310	July 19, 2010
using cuda/cublas in a vector/matrix library CUDA Programming and Performance	2	8602	May 18, 2007
2D Array Not Updated CUDA Programming and Performance	6	5269	May 4, 2010
Really strange memcpy time in matrixMul at SDK CUDA Programming and Performance	6	5126	July 9, 2009
Cublas, keep results on device CUDA Programming and Performance	1	8710	June 25, 2010
simple matrix (or matrix vector) multiplication using CUBLAS CUDA Programming and Performance	9	5645	November 25, 2009

Efficient repeated copying of a vector

Related topics