Creating or resizing thrust::device_vector incurs a kernel call which, if I am not mistaken, fills every element with a default constructed value of the vector’s type:
I use a cached allocator with a thrust::device_vector to eliminate the overhead of allocating temporary memory while having the convenience of this container. However I am not interested in filling all this temporary memory with anything every time I request it because it is an unnecessary overhead.
On the other hand thrust does not give any means of influencing the stream in which this kernel launches (AFAIK) and it happens to do so in the default one. This means creating a thrust::device_vector is always a synchronizing operation which can be disruptive.
In my particular case, using a cached allocator, allocation consists in simply retrieving a previously cudaMalloc’ed pointer so creating/resizing a vector does not even need to communicate with the GPU at all. In this regard I do not understand the design choice of forcing device_vector creation to be a necessarily GPU blocking operation.
Is there anyway to circumvent this kernel launch?