passing vector incl. buffer to the kernel


I’m quite new in using Cuda therefore I would like as you for an advice for the following problem.
During my init-phase I’m allocating at several points N buffer (all have the same size).
To deal better with buffers I’m using a std:vector to store them.
I guess I’m right if I’m saying that I neither can’t use a std:vector nor a thrust::device_vector in a kernel.
Nevertheless I need to iterate over these buffers in my kernel.
How to pass these N buffer to my kernel, in the most efficient way?

As far I understood it correctly, by using thrust::device_vector there is no guarantee that allocated memory is linear, right? So passing the startpointer to the kernel and shifting it N-times by the size is not a solution, right?

But what’s about storing the Start-Pointer and the corresponding sizes in an Nx2 Array, and pass this one to the kernel? Would this work?

Does someone has any suggestions?

Best Greg

Hello guys,

any idea if the following way would work?

  1. Allocate the Memory for my n buffers using “cudaMalloc” on the device.
  2. Initialize the butters with a start value using “cudaMemcpy”.
  3. store the pointers I received in step 1. in a array.
  4. Allocate the memory for my pointer-array on the device using “cudaMalloc”.
  5. upload the content of the pointer-array the device using “cudaMemcpy”

Now I just need to pass the pointer to my pointer-array, I received in step 4 to the kernel to be able iterate over my buffer, right?

cheers greg

A tip:

  1. Align the n buffers into one such, that each of them will start after the other finishes:

data: [ buffer 0 ][ buffer 1 ][ buffer 2 ]…[ buffer n-1 ]

global size is buffer_0_length + buffer_1_length + buffer_2_length + … + buffer_n-1_length

  1. Buffers can differ in length (number of memory cells) - create an additional array that hold each buffer offset (beginning of its data) in ‘boffs’.
  2. boffs[0] = 0, boffs[1] = buffer_0_length, boffs[2] = boffs[1] + buffer_1_length, boffs[2] = boffs[1] + buffer_2_length etc. (size of last buffer not included)
  3. Passing both to kernel will allow You to access them using index like so:
index_to_access_data = boffs[which_buffer] + pos_in_a_buffer;

Having such one global buffer (here refered as ‘data’) You can reduce number of cudaMemcpy calls to only two (one for ‘data’, second for ‘boffs’).
All buffers need to have it’s cells of the same type of course (e.g. float4).


I got it,

tnx a lot cmaster.matso