Does thrust::device_vector::resize() cause reallocation when resizing to a smaller size?

In STL, a std::vector allocated memory cannot shrink, unless we call shrink_to_fit().

Is it also the same in thrust? Shrinking a vector cause a re-allocation that could hurt performances.

In the general case it doesn’t seem to imply a reallocation when you resize() to a smaller size.

Thrust is open source, so you can verify anything you want yourself, with enough effort. For me, personally, I usually prefer to write a test case and then use the profiler to explore behavior. Here is an example:

$ cat t1.cu
#include <thrust/device_vector.h>
#include <thrust/sequence.h>

int main(){

  thrust::device_vector<int> A(1048576);
  A.resize(32);
  thrust::sequence(A.begin(), A.end());
}
$ nvcc -o t1 t1.cu
$ nvprof --print-api-trace ./t1
==12621== NVPROF is profiling process 12621, command: ./t1
==12621== Profiling application: ./t1
==12621== Profiling result:
   Start  Duration  Name
175.95ms  8.8040us  cuDeviceGetPCIBusId
<snip>
302.11ms     697ns  cuDeviceGetUuid
302.11ms     420ns  cuDeviceGetAttribute
302.11ms     433ns  cuDeviceGetAttribute
302.11ms     437ns  cuDeviceGetAttribute
< // this is the initial allocation of the vector>
302.87ms  380.28ms  cudaMalloc 
683.16ms  2.8630us  cudaGetDevice
683.17ms  1.0970us  cudaGetLastError
683.17ms     299ns  cudaGetDeviceCount
683.17ms     184ns  cudaGetLastError
683.17ms     598ns  cudaGetDevice
683.17ms     184ns  cudaGetLastError
683.17ms  8.0490us  cudaFuncGetAttributes
683.18ms     182ns  cudaGetLastError
683.18ms     182ns  cudaGetLastError
683.18ms     506ns  cudaGetDevice
683.18ms     176ns  cudaGetLastError
683.19ms  2.1260us  cudaDeviceGetAttribute
683.19ms     177ns  cudaGetLastError
<// this is the kernel to initialize the vector allocation>
683.19ms  54.607us  cudaLaunchKernel (void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<
683.25ms     330ns  cudaPeekAtLastError
683.25ms     174ns  cudaPeekAtLastError
683.25ms     175ns  cudaGetLastError
683.25ms     173ns  cudaGetLastError
683.25ms  12.317us  cudaStreamSynchronize
683.26ms     216ns  cudaGetLastError
683.26ms     176ns  cudaGetLastError
683.27ms     741ns  cudaGetDevice
683.27ms     207ns  cudaGetLastError
683.27ms     176ns  cudaGetLastError
683.27ms     409ns  cudaGetDevice
683.27ms     174ns  cudaGetLastError
683.27ms     648ns  cudaDeviceGetAttribute
683.27ms     179ns  cudaGetLastError
<// this is the kernel launched by the sequence call>
683.28ms  12.869us  cudaLaunchKernel (void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__tabulate::functor<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::system::detail::generic::detail::comput
683.29ms     262ns  cudaPeekAtLastError
683.29ms     177ns  cudaPeekAtLastError
683.29ms     182ns  cudaGetLastError
683.29ms     171ns  cudaGetLastError
683.29ms  3.2460us  cudaStreamSynchronize
683.29ms     192ns  cudaGetLastError
683.29ms     180ns  cudaGetLastError
<//this is where the vector allocation gets freed, there are no prior calls to cudaFree>
683.30ms  228.90us  cudaFree
$ nvprof --print-gpu-trace ./t1
==12635== NVPROF is profiling process 12635, command: ./t1
==12635== Profiling application: ./t1
==12635== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*           Device   Context    Stream  Name
581.59ms  9.4090us           (2048 1 1)       (256 1 1)        16        0B        0B  Tesla V100-PCIE         1         7  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>(thrust::device_ptr<int>, int) [446]
581.64ms  3.2960us              (1 1 1)       (256 1 1)        16        0B        0B  Tesla V100-PCIE         1         7  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__tabulate::functor<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::system::detail::generic::detail::compute_sequence_value<int>, long>, long>, thrust::cuda_cub::__tabulate::functor<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::system::detail::generic::detail::compute_sequence_value<int>, long>, long>(thrust::device_ptr<int>, thrust::detail::normal_iterator<thrust::device_ptr<int>>) [461]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$

(to reduce unnecessary clutter, I’ve snipped/removed the early output from --print-api-trace, since there are a large number of CUDA API calls at application start up that are uninteresting for this question).

Parsing the above, we see in the api trace that there is only one cudaFree call, at the very end, after the sequence operation. Furthermore, from the gpu trace we can confirm that the first kernel launched by thrust (triggered by the device_vector instantiation, used to initialize the device vector) has a large threadcount, consistent with the larger vector size at that point. The second kernel has a much smaller threadcount, consistent with the size of the vector at the point of the sequence call.

So there is no reallocation of the vector in between the instantiation and the sequence call. There are no cudaMalloc or cudaFree calls in-between the two kernels which correspond to the points of vector instantiation and sequence calls, even though a resize() operation occurs in that region.

1 Like