Host memory allocation by libcudart

I am looking into why the memory usage of some software went up significantly, and according to massif (stack profiler in valgrind) the difference is due to something in libcudart.

Before:
| | ->39.02% (77,542,960B) 0x12F7DA25: ??? (in /lib/Linux-x86_64/libcudart.so.10.0)
| | | ->39.02% (77,542,960B) 0x12F7529E: ??? (in /lib/Linux-x86_64/libcudart.so.10.0)
| | | ->39.02% (77,542,960B) 0x12F826A5: ??? (in /lib/Linux-x86_64/libcudart.so.10.0)
| | | ->39.02% (77,542,960B) 0x12F842BF: ??? (in /lib/Linux-x86_64/libcudart.so.10.0)
| | | ->39.02% (77,542,960B) 0x12F7743C: ??? (in /lib/Linux-x86_64/libcudart.so.10.0)
| | | ->39.02% (77,542,960B) 0x12F62678: ??? (in /lib/Linux-x86_64/libcudart.so.10.0)
| | | ->39.02% (77,542,960B) 0x12F89BF7: cudaDeviceSynchronize (in /lib/Linux-x86_64/libcudart.so.10.0)
| | | ->39.02% (77,542,960B) 0x10BF9D3A: device_selection::select(int*, char***) (device_selection.cxx:149)
| | | ->39.02% (77,542,960B) 0x10C1BE89: boot_gpu (gpu_functions.cxx:261)
| | | ->39.02% (77,542,960B) 0x10C17736: collective_impl_mpi::collective_impl_mpi(int*, char***) (collective_mpi.cxx:37)
| | | ->39.02% (77,542,960B) 0x10C14B5B: boot_collective (collective_common.cxx:72)
| | | ->39.02% (77,542,960B) 0x400AB7: main (main.cxx:8)

After:
| | ->61.92% (301,919,152B) 0x220C9A25: ??? (in /lib/Linux-x86_64/libcudart.so.10.0)
| | | ->61.92% (301,919,152B) 0x220C129E: ??? (in /lib/Linux-x86_64/libcudart.so.10.0)
| | | ->61.92% (301,919,152B) 0x220CE6A5: ??? (in /lib/Linux-x86_64/libcudart.so.10.0)
| | | ->61.92% (301,919,152B) 0x220D02BF: ??? (in /lib/Linux-x86_64/libcudart.so.10.0)
| | | ->61.92% (301,919,152B) 0x220C343C: ??? (in /lib/Linux-x86_64/libcudart.so.10.0)
| | | ->61.92% (301,919,152B) 0x220AE678: ??? (in /lib/Linux-x86_64/libcudart.so.10.0)
| | | ->61.92% (301,919,152B) 0x220D5BF7: cudaDeviceSynchronize (in /lib/Linux-x86_64/libcudart.so.10.0)
| | | ->61.92% (301,919,152B) 0x204F1E1A: device_selection::select(int*, char***) (device_selection.cxx:149)
| | | ->61.92% (301,919,152B) 0x20513F69: boot_gpu (gpu_functions.cxx:261)
| | | ->61.92% (301,919,152B) 0x2050F816: collective_impl_mpi::collective_impl_mpi(int*, char***) (collective_mpi.cxx:37)
| | | ->61.92% (301,919,152B) 0x2050CC3B: boot_collective (collective_common.cxx:72)
| | | ->61.92% (301,919,152B) 0x400AB7: main (main.cxx:8)

The call to cudaDeviceSynchronize is preceded immediately by a call to cudaSetDevice.

What could be causing this? (Could it be connected to the fact that we added some more CUDA kernels to the code between these two versions?) And what can I do to get more information about what is happening?