Unknown vector size

adi.panzer · April 13, 2020, 5:34am

Hi,

I’m using cuda functions in asynchronous mode to process a large vector. In the middle of the processing chain I can eliminate most of the vector and stay with ~10% of the elements that are grouped into smaller vector. The number of elements in the small vector is known only in run time.

How can I process the smaller vector without returning to the host?
In part of the processing chain I want to us cuBLAS library, but it looks like I must know the size when calling the function.

Thanks,
Adi

luis.leon · April 13, 2020, 5:59pm

Hi adi.panzer

I think that you are trying to hide the communication latency by processing chunks of the vector. I mean, processing the uploaded part of the vector while it is still uploading, am I right?

What I can suggest you in this case is to create several streams to upload parts of the vector and process it as soon as they are uploaded.

For example:

Stream 1: Chunk1 → Process Chunk1 → Chunk3 → Process Chunk3
Stream 2: Chunk2 → Process Chunk2 → Chunk4 → Process Chunk4

Adding elements to the stream makes it asynchronous, freeing time in your host for other management tasks,

About cuBLAS, as you may see in the documentation, the matrix sizes are compulsory parameters in most of the functions. This is made like that to respect the old Fortran-based BLAS library.

Regards,
Leon

adi.panzer · April 14, 2020, 4:52am

Hi,

Thanks for the reply.
The vector is uploaded at once. During process I decide to drop parts of the vector.
For example in the 1st phase I calculate magnitude of each element, and for further phases I want to process only elements that are greater than 10.
Of course I can process entire vector and check the magnitude, but I want to create a shorter vector (10%) only of valid entries and process only the shorter vector.

Thanks,
Adi

luis.leon · April 14, 2020, 8:28am

Hi @adi.panzer

Ok. I think I’ve got it. You want to somehow “decimate” the vector with values which matches some criteria. In this case, the first idea which came to my mind is just creating a smaller shared vector. However, it will depend on the scope of your application selecting the proper kind of memory. For example, if you are working with several blocks, the best way is saving the smaller vector using the global memory and then, download it to shared memory to speed up the reading process.

The reason why I think in creating a new vector is because of getting higher throughput by making the memory access coalesced.

Regards,
Leon.

adi.panzer · April 14, 2020, 8:45am

Hi Luis,

The actual problem is more complicated. The “vector” is very large array array that passes several processing stages. In some stage I can decide on elements that are no longer relevant for further processing, and I want to exclude them from the next stages. I can’t expect where will the relevant elements be and they are usually clustered. It is not feasible to do it in shared memory as it still is a large array (~50MB).

The process of creating the small vector is simple and efficient, I know the indices of the good elements and squeeze them into new vector. I struggle with how to process the new vector as the vector’s length is unknown when I add the tasks to the relevant cuda stream.
Currently I placed a global on the device memory that holds the number of elements. Each thread reads this number and can terminate the loop on elements.

int ActiveBins = dActiveBins;  // this is the number of active elements 
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;

for (int Bin = index; Bin < ActiveBins; Bin+=stride)
    {

    }

luis.leon · April 14, 2020, 3:43pm

Hi adi,

My initial idea was precisely construct a smaller array in global memory and download parts of that global memory to the shared memory (so, you only download what you need to process per block) for speed up as a sort of cache. You are really right when you say that processing small vector is simple and efficient from the memory consumption point of view, but breaking the coalesced memory access will waste your performance. My idea is old-fashion though.

If you don’t have a clue of reserving enough memory, you might be interested in using thrust::vector which is the CUDA version of the STL C++ std::vector. It is automatically resized to your needs and it lives in the GPU device memory. If you want to burst your speed, you need to consider some caching strategy (recall to the memory hierarchy in GPU)

Regards,
Leon

adi.panzer · April 16, 2020, 4:31am

Thanks,

I’ll look into it

Topic		Replies	Views
using cuda/cublas in a vector/matrix library CUDA Programming and Performance	2	8599	May 18, 2007
Passing thurst vector into kernel and pushing data into vector CUDA Programming and Performance	8	8097	January 2, 2018
The threads did not work on a massive vector CUDA Programming and Performance	1	524	June 8, 2016
Cuda, coping data from global to shared memory CUDA Programming and Performance cuda , kernel	0	478	June 16, 2020
Vector Reduction CUDA Programming and Performance	3	19761	March 9, 2011
Vectors or dynamic memory? CUDA Programming and Performance	2	579	April 26, 2017
CUDA large vectors cause errors CUDA Programming and Performance cuda	2	332	December 19, 2022
Subtracting a vector from every column of a matrix CUDA Programming and Performance	7	10563	July 27, 2017
Vector[1xN] * Matrix[NxM] How would you set it up ? CUDA Programming and Performance	3	4360	October 13, 2008
How to accelerate N-Vector Scalar Product CUDA Programming and Performance	4	5068	August 8, 2007

Unknown vector size

Related topics