Efficient repeated copying of a vector

I have a device float vector of length 784 and I want to copy it 60000 times to a contiguously-allocated array. (i.e. end result is a 60000 x 784 row-dom matrix where each col has 60k repetitions of the same value). I’m currently doing this by calling cublasScopy 60000 times from the host, but this obviously gives really bad performance due to kernel launch overhead.

Is there a way to do this with a single kernel call? Obviously I can write my own but I feel there must be an obvious cublas or other library function I’m overlooking.

PS Bonus points for anyone who can guess what canonical problem I’m working on based on the dimensions above :)

First rule of copying: Don’t copy. Why is data replication necessary? I would try thinking about a solution where the data is referenced via pointer, completely eliminating the need to physically replicate it. If you must replicate (for reasons not explained), do it in power-of-two steps: from one row create two contiguous rows, then create four rows from two rows, etc. Much fewer copies, each significantly bigger in size, which should cut down on overhead. Ultimately you will still be limited by bandwidth.

Which gets us back to the first rule of copying …

The matrix I described above is going to be an input into cublasSgemm. Specifically the calculation is
C = A*B + C, where the original version of C is the matrix described above, where each row is identical. The output of gemm will be overriding the values in C.

This means if I don’t construct C this way before calling gemm then I would have to call a separate kernel to perform the addition step. Doing it this way means I get it “for free” from gemm since I’m calling it anyway to get A*B.

you can do it with a cublas rank-1 update:

http://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-ger

x is your existing length 784 row definition vector
y is a (column) vector of all 1, length 60000.

It should be certainly doable with thrust. I doubt either of these methods would be faster than writing your own kernel.

I can believe something I’d be capable of writing could outperform Thrust, but I’m surprised you think I could beat a cublas function. Would you say the same thing about e.g. gemm, or is there something specific about this operation?

The operation is memory bound. You want to do the minimum number of memory accesses theoretically possible in order to populate your matrix (or even better, like @njuffa said, don’t do the copying at all, but I don’t have any great suggestions there ATM). That theoretical minimum is 784 reads and 60000784 writes (or possibly 59999784 writes).

If you study the Sger operation carefully (read the arithmetic in the CUBLAS manual, please), and count the number of memory accesses (per element) that would be needed to your matrix, I come up with a minimum of 3: one to write an initial value of 0, one to read, one to write the rank-1 update + sum. I don’t see how you get to optimality with that particular arithmetic.

A thrust device vector allocation automatically also initializes the data to zero. Plus you need at least 1 for the write (copy) operation. Yes, you can work around this with a flat allocation and a thrust device pointer instead of a device vector. So you might be able to get to “optimality” with thrust.

I consider myself a reasonably competent CUDA programmer. One of the implications is that I can write a pretty simple CUDA code and immediately figure out the number of accesses needed or will be generated. With thrust, I would have to write the code, then either wade through a templated library counting accesses, or analyze the result with a profiler, to see if it is really “optimal”. With CUBLAS, it’s a little better, I read the docs. Then I write the code.

With CUDA, I just write the code, and I have a high level of confidence of being pretty near optimal. I’m darn sure in CUDA I can write a kernel that will read exactly 784 elements in an optimized (coalesced) fashion, so exactly 784 float reads, and do exactly 60000*784 coalesced float writes, with no extra reads or writes anywhere.

I’m not anywhere near as confident of getting to that definition of optimality with CUBLAS (I’m quite sure not) and/or with thrust (maybe).

I would most certainly not say the same thing about a sort, or a matrix multiply, or a prefix sum. I think I can write a reasonably good reduction, however, I would always use a library call for sort, matrix multiply, or prefix sum (except for silly cases like a 32-element sort, or prefix sum).

For an algorithm that is memory bound, your goal is to make the minimum possible number of memory accesses, in an efficient way.

Thanks for this explanation, makes perfect sense for a pure memory operation. And this actually gives me a lot of clarity in general for when to write my own kernel vs use something from the library.

This is exactly what I am looking for, except more like 32k (complex) elements, then, do a 1-to-1 mapped processing with each chunk, and write each segment out… say 16 times to an output array (16*32k writes). The reasoning is I want to then call a CUFFT C2C batched plan on the resulting array.

I’ve looked for some sort of example for this that does it in such an optimal fashion but didn’t see anything relevant. Any pointers welcome.

perhaps this will help:

$ cat t40.cu
#include <iostream>
#include <thrust/complex.h>

const size_t ds = 32768;
const size_t rows = 16;
typedef thrust::complex<float> mt;
const int nTPB = 256;

template <typename T>
__host__ __device__
T my_op(T x){
  return x*x;}

template <typename T, size_t _ds = ds, size_t _rows = rows>
__global__ void k(const T * __restrict__ din, T * __restrict__ dout){
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  T my_val;
  if (idx < _ds){
    my_val = my_op(din[idx]);
    for (size_t i=0; i < _rows; i++){
      dout[idx] = my_val;
      idx += _ds;}
  }
}

int main(){

  mt *d_in, *d_out, *h_in;
  h_in = new mt[ds];
  for (int i = 0; i < ds; i++) h_in[i] = mt((float)i, 1.0f);
  cudaMalloc(&d_in,  ds*sizeof(mt));
  cudaMalloc(&d_out, ds*rows*sizeof(mt));
  cudaMemcpy(d_in, h_in, ds*sizeof(mt), cudaMemcpyHostToDevice);
  k<<<(ds+nTPB-1)/nTPB,nTPB>>>(d_in, d_out);
  cudaMemcpy(h_in, d_out, ds*sizeof(mt), cudaMemcpyDeviceToHost);
  cudaError_t err = cudaGetLastError();
  if (err == cudaSuccess){
  for (int i = 0; i < 5; i++)
    std::cout << h_in[i].real() << "," << h_in[i].imag() << std::endl;
  }
  else
    std::cout << cudaGetErrorString(err) << std::endl;
}

$ nvcc -o t40 t40.cu
$ ./t40
-1,0
0,2
3,4
8,6
15,8
$

This of course writes out the same values to each row, so not that interesting by itself. But you could create something similar to my_op and put it in the inner for-loop in the kernel, and it will modify each row before it is written.

dout[idx] = my_row_op(my_val, i);

Thanks! When I tried a quick test case, it didn’t seem to match what I expected. I realized the issue is that you were reusing the same h_i variable as the output. Once I created/initialized a new output variable and copied ds*nrows back to it, it worked as expected. :)