Unknown vector size


I’m using cuda functions in asynchronous mode to process a large vector. In the middle of the processing chain I can eliminate most of the vector and stay with ~10% of the elements that are grouped into smaller vector. The number of elements in the small vector is known only in run time.

How can I process the smaller vector without returning to the host?
In part of the processing chain I want to us cuBLAS library, but it looks like I must know the size when calling the function.


Hi adi.panzer

I think that you are trying to hide the communication latency by processing chunks of the vector. I mean, processing the uploaded part of the vector while it is still uploading, am I right?

What I can suggest you in this case is to create several streams to upload parts of the vector and process it as soon as they are uploaded.

For example:

Stream 1: Chunk1 -> Process Chunk1 -> Chunk3 -> Process Chunk3
Stream 2: Chunk2 -> Process Chunk2 -> Chunk4 -> Process Chunk4

Adding elements to the stream makes it asynchronous, freeing time in your host for other management tasks,

About cuBLAS, as you may see in the documentation, the matrix sizes are compulsory parameters in most of the functions. This is made like that to respect the old Fortran-based BLAS library.



Thanks for the reply.
The vector is uploaded at once. During process I decide to drop parts of the vector.
For example in the 1st phase I calculate magnitude of each element, and for further phases I want to process only elements that are greater than 10.
Of course I can process entire vector and check the magnitude, but I want to create a shorter vector (10%) only of valid entries and process only the shorter vector.


Hi @adi.panzer

Ok. I think I’ve got it. You want to somehow “decimate” the vector with values which matches some criteria. In this case, the first idea which came to my mind is just creating a smaller shared vector. However, it will depend on the scope of your application selecting the proper kind of memory. For example, if you are working with several blocks, the best way is saving the smaller vector using the global memory and then, download it to shared memory to speed up the reading process.

The reason why I think in creating a new vector is because of getting higher throughput by making the memory access coalesced.


Hi Luis,

The actual problem is more complicated. The “vector” is very large array array that passes several processing stages. In some stage I can decide on elements that are no longer relevant for further processing, and I want to exclude them from the next stages. I can’t expect where will the relevant elements be and they are usually clustered. It is not feasible to do it in shared memory as it still is a large array (~50MB).

The process of creating the small vector is simple and efficient, I know the indices of the good elements and squeeze them into new vector. I struggle with how to process the new vector as the vector’s length is unknown when I add the tasks to the relevant cuda stream.
Currently I placed a global on the device memory that holds the number of elements. Each thread reads this number and can terminate the loop on elements.

int ActiveBins = dActiveBins;  // this is the number of active elements 
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;

for (int Bin = index; Bin < ActiveBins; Bin+=stride)


Hi adi,

My initial idea was precisely construct a smaller array in global memory and download parts of that global memory to the shared memory (so, you only download what you need to process per block) for speed up as a sort of cache. You are really right when you say that processing small vector is simple and efficient from the memory consumption point of view, but breaking the coalesced memory access will waste your performance. My idea is old-fashion though.

If you don’t have a clue of reserving enough memory, you might be interested in using thrust::vector which is the CUDA version of the STL C++ std::vector. It is automatically resized to your needs and it lives in the GPU device memory. If you want to burst your speed, you need to consider some caching strategy (recall to the memory hierarchy in GPU)



I’ll look into it