In STL, a std::vector allocated memory cannot shrink, unless we call shrink_to_fit().
Is it also the same in thrust? Shrinking a vector cause a re-allocation that could hurt performances.
In STL, a std::vector allocated memory cannot shrink, unless we call shrink_to_fit().
Is it also the same in thrust? Shrinking a vector cause a re-allocation that could hurt performances.
In the general case it doesn’t seem to imply a reallocation when you resize()
to a smaller size.
Thrust is open source, so you can verify anything you want yourself, with enough effort. For me, personally, I usually prefer to write a test case and then use the profiler to explore behavior. Here is an example:
$ cat t1.cu
#include <thrust/device_vector.h>
#include <thrust/sequence.h>
int main(){
thrust::device_vector<int> A(1048576);
A.resize(32);
thrust::sequence(A.begin(), A.end());
}
$ nvcc -o t1 t1.cu
$ nvprof --print-api-trace ./t1
==12621== NVPROF is profiling process 12621, command: ./t1
==12621== Profiling application: ./t1
==12621== Profiling result:
Start Duration Name
175.95ms 8.8040us cuDeviceGetPCIBusId
<snip>
302.11ms 697ns cuDeviceGetUuid
302.11ms 420ns cuDeviceGetAttribute
302.11ms 433ns cuDeviceGetAttribute
302.11ms 437ns cuDeviceGetAttribute
< // this is the initial allocation of the vector>
302.87ms 380.28ms cudaMalloc
683.16ms 2.8630us cudaGetDevice
683.17ms 1.0970us cudaGetLastError
683.17ms 299ns cudaGetDeviceCount
683.17ms 184ns cudaGetLastError
683.17ms 598ns cudaGetDevice
683.17ms 184ns cudaGetLastError
683.17ms 8.0490us cudaFuncGetAttributes
683.18ms 182ns cudaGetLastError
683.18ms 182ns cudaGetLastError
683.18ms 506ns cudaGetDevice
683.18ms 176ns cudaGetLastError
683.19ms 2.1260us cudaDeviceGetAttribute
683.19ms 177ns cudaGetLastError
<// this is the kernel to initialize the vector allocation>
683.19ms 54.607us cudaLaunchKernel (void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<
683.25ms 330ns cudaPeekAtLastError
683.25ms 174ns cudaPeekAtLastError
683.25ms 175ns cudaGetLastError
683.25ms 173ns cudaGetLastError
683.25ms 12.317us cudaStreamSynchronize
683.26ms 216ns cudaGetLastError
683.26ms 176ns cudaGetLastError
683.27ms 741ns cudaGetDevice
683.27ms 207ns cudaGetLastError
683.27ms 176ns cudaGetLastError
683.27ms 409ns cudaGetDevice
683.27ms 174ns cudaGetLastError
683.27ms 648ns cudaDeviceGetAttribute
683.27ms 179ns cudaGetLastError
<// this is the kernel launched by the sequence call>
683.28ms 12.869us cudaLaunchKernel (void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__tabulate::functor<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::system::detail::generic::detail::comput
683.29ms 262ns cudaPeekAtLastError
683.29ms 177ns cudaPeekAtLastError
683.29ms 182ns cudaGetLastError
683.29ms 171ns cudaGetLastError
683.29ms 3.2460us cudaStreamSynchronize
683.29ms 192ns cudaGetLastError
683.29ms 180ns cudaGetLastError
<//this is where the vector allocation gets freed, there are no prior calls to cudaFree>
683.30ms 228.90us cudaFree
$ nvprof --print-gpu-trace ./t1
==12635== NVPROF is profiling process 12635, command: ./t1
==12635== Profiling application: ./t1
==12635== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Device Context Stream Name
581.59ms 9.4090us (2048 1 1) (256 1 1) 16 0B 0B Tesla V100-PCIE 1 7 void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>(thrust::device_ptr<int>, int) [446]
581.64ms 3.2960us (1 1 1) (256 1 1) 16 0B 0B Tesla V100-PCIE 1 7 void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__tabulate::functor<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::system::detail::generic::detail::compute_sequence_value<int>, long>, long>, thrust::cuda_cub::__tabulate::functor<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::system::detail::generic::detail::compute_sequence_value<int>, long>, long>(thrust::device_ptr<int>, thrust::detail::normal_iterator<thrust::device_ptr<int>>) [461]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$
(to reduce unnecessary clutter, I’ve snipped/removed the early output from --print-api-trace
, since there are a large number of CUDA API calls at application start up that are uninteresting for this question).
Parsing the above, we see in the api trace that there is only one cudaFree
call, at the very end, after the sequence
operation. Furthermore, from the gpu trace we can confirm that the first kernel launched by thrust (triggered by the device_vector
instantiation, used to initialize the device vector) has a large threadcount, consistent with the larger vector size at that point. The second kernel has a much smaller threadcount, consistent with the size of the vector at the point of the sequence
call.
So there is no reallocation of the vector in between the instantiation and the sequence
call. There are no cudaMalloc
or cudaFree
calls in-between the two kernels which correspond to the points of vector instantiation and sequence
calls, even though a resize()
operation occurs in that region.