GPU Memory Deallocation and Frequency Scaling are not workin

Hello,

I’m using the PGI Compiler 14.7 and 14.9 to compile OpenACC Code for a Nvidia Tesla GPU. The problem refers to my problems I’ve already written in this Thread:
https://forums.developer.nvidia.com/t/runtime-error-caused-by-dynamic-loading-of-2-shared-objects/134200/1

The problem is that the GPU memory doesn’t get deallocated after the execution.
The Memory gets only deallocated when the program terminates.
Now I’ve recognized that the frequency scaling on the GPU is also not working. That means the GPU frequency is reduced if the program terminates but not after the GPU Kernel is executed. As you can imagine this is very bad for the idle energy consumption of the GPU.

Here is an example of my OpenACC Code:

int32_t init_gpu(int32_t val) {
	int32_t value[2];
	value[0] = val;
	value[1] = -1;
	
	int32_t size = 2 * sizeof(int32_t);
	
	int32_t *d_value = acc_malloc(size);
	
	acc_memcpy_to_device(d_value, value, size);

	#pragma acc kernels deviceptr(d_value)
	{
		d_value[1] = d_value[0] - 1;
	}
	
	acc_memcpy_from_device(value, d_value, size);

	acc_free(d_value);
	
	return value[1];
}

I’m compiling this Code with the follwing command into a shared object:

pgcc -acc=verystrict -ta=tesla,cuda5.5 -O0 -Minfo -fPIC -c -o init_gpu.o ../gpu/init_gpu.c
pgcc -acc=verystrict -ta=tesla,cuda5.5 -O0 -Minfo -shared -o libinitgpu.so init_gpu.o

This shared object is called via dlopen and dlclose from a Java application with a JNI wrapper.
During the execution the nvidia-smi tool shows that 65 MiB are allocated for the Java process on the GPU until the Java process is terminated.
And the GPU core frequency is 700 MHz and the memory frequency is 2600 MHz but it should be 324 Mhz if nothing is executed on the GPU.

I’ve also tried to call the shared object from C Code and used other OpenACC directives.

There needs to be a way to free the memory on the GPU without terminating the complete program.

Has someone an idea why this isn’t working?

Chris

Hi Chris,

Apologies for such a late reply. I was away at a conference last week and missed seeing your post.

My best guess is that it’s not that the memory isn’t being free’d, rather that the device context is still open.

Can you try calling “acc_shutdown” once you’re done using the device?

  • Mat

Hi Mat,

thanks for your answer. This time I’m a little bit late with my reply.

Your guess is right. It works if I use acc_shutdown to close the device.

The problem is that it takes a long time to initialize the GPU again after I’ve called acc_shutdown. Most of my GPU kernels run only a few seconds.
I cannot achieve any speedup in comparison to the CPU if I call acc_shutdown after every kernel execution.

Chris

Yes, it takes about a second per device to power the GPU back up. Unfortunately there’s not much that can be done. You’ll either need to keep the device powered up and use the extra power, or power it down and take the performance hit powering it back up each time.

  • Mat

I have to ask once again because it sounds a bit strange to me.
The GPU frequency is slowed down only if I have closed the device context and it is completely normal that the memory gets free’d and usable for other GPU kernels only if I close the device context?

Chris

The current Tesla GPUs don’t throttle the frequency, rather the device is put into a stand-by state and powered down when the context is released.

Newer Maxwell based GeoForce devices have GPU Boost 2.0 which does do adaptive frequency modulation but this is not available yet on Tesla.