HPC SDK C++ unified memory & minimum heap allocation size

Hi,

I’m interested particularly in the C++ Parallel Algorithm support.

As I understand it all allocation from the heap is now unified memory hence pageable from host to device and back.

I’m assuming this page mechanism has a fixed page size e.g. 4K.

Does this mean that the minimum allocation size from the heap is now 4K and how does that affect e.g. linked lists elements - dynamically allocated isolated small data type instances?

Does the memory usage become much larger and how would one manage this if so?

With many thanks,

Leigh…

While the migration of the data between the host and device is performed at a page granularity, the allocation of data is not. Rather the allocation is done within pool of unified memory so multiple small data types could be allocated within the same page. We actually created this pool for OpenACC (but is used with standard language parallelism as well) to help with the small memory allocations done in C++.

See: https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html#acc-mem-unified

Hi Mat,

Thanks for your reply. I read the reference link but I am somewhat troubled. If two variables are allocated in the same 4K page and one is used by the CPU while the other used by the GPU would the page thrash back and forth?

How is it possible to avoid this?

Thanks as always,

Leigh.

Hi Leigh,

Yes, if you had this condition then you would indeed see page migration back and forth between the CPU and GPU. Unfortunately, C++ stdpar doesn’t yet include methods to handle memory allocation when using multiple discrete memories, so in order for it work now, we must implicitly allocate all memory in the unified space.

The best way to avoid it would be to offload all the compute to the GPU so data is only copied once to and once from the device.

Sans that, you’d need to move to using OpenACC or CUDA, instead of stdpar, so you can explicitly manage the data yourself.

-Mat

I should also note, that if you do see a case where the performance is severely impacted by this scenario, please let us know, and if possible, provide an example we can inspect.

Hi Mat,

Thanks for that.

I’m having problems getting any information from the profiler. I run nvprof ./app_name and it gives me no data but gives me a warning about need to run a particular function to flush the buffers. (sorry I don’t have the particular message in front of me at the moment.

Do I need to compile with some extra flags and what command line should I be running?

Thanks,

Leigh.

Hi Leigh,

Yes, I’ve seen this error before as well but offhand have forgotten the solution so will need to ping our profiler folks.
Though nvprof has been deprecated, so you mind first trying Nsight-Systems instead? Nsight Systems and Nsight Compute are nvprof’s replacement. See: https://docs.nvidia.com/nsight-systems/profiling/index.html#cli-profiling for details.

-Mat

Hi Mat,

Thanks for that.

Here’s the output from my run (it runs under Gitlab-CI which is really useful as run and results capture is automated)…

+ nvprof ./main_stdpar
Setting up variables
==5564== NVPROF is profiling process 5564, command: 
./main_stdpar
Calculating the pythagorean distances
Running 1000 iterations
Calculation completed.
Elapsed time in nanoseconds : 57133003 ns
Elapsed time in microseconds : 57133 µs
Elapsed time in milliseconds : 57 ms
Elapsed time in seconds : 0 sec


==5564== Profiling application: ./main_stdpar
==5564== Profiling result:
No kernels were profiled.
No API activities were profiled.
==5564== Warning: Some profiling data are not recorded. Make 
sure cudaProfilerStop() or cuProfilerStop() is called before 
application exit to flush profile data.