Runtime error caused by dynamic loading of 2 shared objects


I always get a runtime error when I execute my program , I think the error is caused by the dynamic loading of two shared objects which include openACC code. I’m dealing with this problem for a few days but I can’t find my mistake.

I’m using pgcc 14.7 to compile openAcc code and create the shared objects for a Nvidia kepler GPU.

Here is the output of the program:

SourceImage: lenna.tiff 
TargetImage: lenna.png
Algorithm: service_dim 
Resource: gpu 
Load Library: 
Load Library: 
Read source image: lenna.tiff width: 512 height: 512
Load Function: run_service_dim_gpu 
Executed run_service_dim_gpu(). Status Code = 0
Wrote target image: lenna.png
Read source image: lenna.tiff width: 512 height: 512
Load Function: run_service_grey_gpu 
call to cuModuleGetFunction returned error 500: Not found

At first both llibraries are opened with dlopen as you can see in the following example:

void *handler = dlopen(, RTLD_LAZY| RTLD_GLOBAL);
void *handler2 = dlopen("", RTLD_LAZY | RTLD_GLOBAL );

This works fine and in the next step I’m loading the function from the first shared object with dlsym():

run_service = (run_service_t) dlsym(handler,"run_service_dim_gpu");

runservice_t is defined as

 typedef int32_t (*run_service_t)(PixelPacket *, PixelPacket *, int32_t, int32_t);

As you can see in the output above everything works fine, I can call the function without any problems.

But then I do the same for the function in the second shared object:

run_service = (run_service_t) dlsym(handler,"run_service_grey_gpu");

and call the function with:

statuscode=run_service(pixpack_target, pixpack_source, rows, columns);

Now I just get the error message:

call to cuModuleGetFunction returned error 500: Not found

At the end I’m closing the libraries but I never reach this point


I’ve already tried to open the first library, call the function and close the first library, before I open the second library and call the second function.

This works, but later on I want to open more than only 2 libraries and execute all the functions in a random order. So it would need unnecessary much time to close a library and open it later again.

Here is one example for the OpenACC code, the other function is very similar:


#include "service_grey.h"

int32_t run_service_grey_gpu(PixelPacket *pixpack_target, PixelPacket *pixpack_source, int32_t rows, int32_t columns) {
		//Tranform RGB to Grey Picture
		int32_t pos,i,j;
		#pragma acc kernels copyin(pixpack_source[0:columns*rows]) copyout(pixpack_target[0:columns*rows])
			#pragma acc loop independent	
			for (i=0; i<rows; ++i) {
				#pragma acc loop independent private(pos)
				for (j=0; j<columns; ++j) {
					pos = i*columns+j;
					pixpack_target[pos].red = (pixpack_source[pos].red+pixpack_source[pos].green+pixpack_source[pos].blue)/3;
					pixpack_target[pos].green = (pixpack_source[pos].red+pixpack_source[pos].green+pixpack_source[pos].blue)/3;
					pixpack_target[pos].blue = (pixpack_source[pos].red+pixpack_source[pos].green+pixpack_source[pos].blue)/3;
					pixpack_target[pos].opacity	= 0;
	return 0;

I’m using the following commands to create the shared objects:

pgcc -acc -ta=tesla,cuda6.0,keep -O0 -Minfo -fPIC -c -o service_grey_gpu.o ../csrc/service_grey/service_grey_gpu.c
	pgcc -acc -ta=tesla,cuda6.0,keep -O0 -Minfo -shared -o service_grey_gpu.o

There are no errors or warnings during the compiling, her is the output:

      8, Generating copyin(pixpack_source[:rows*columns])
         Generating copyout(pixpack_target[:rows*columns])
         Generating Tesla code
     11, Loop is parallelizable
     13, Loop is parallelizable
         Accelerator kernel generated
         11, #pragma acc loop gang /* blockIdx.y */
         13, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
pgcc -acc -ta=tesla,cuda6.0,keep -O0 -Minfo -shared -o service_grey_gpu.o

Also I’ve tried to compile the shared objects without the -acc flag, so that I’m running the functions on the CPU. This works good but I want to use the GPU.

So my assumption is that there are some issues with the loading of the libraries and the GPU execution.

Hopefully someone has an idea why this isn’t working.


I still haven’t solved this problem.

Hi Chris,

I’ve been looking into this and asking folk here for ideas. I’ll probably need to create a reproducing example in order to understand what’s wrong.

I post once I have more information.

  • Mat

Hi Mat,

have you get any further with my problem?

In the Meantime I’ve got a new problem which might be related to my old problem.

As I’ve already written I’m executing different shared objects which include the openACC code.

Since I am still not able to open all the shared objects at the same time I’m loading the first shared object, execute it, close it and open the next shared object. I’ve already desribed how I open the shared objects in the first posting.

Now my problem is that the gpu memory doesn’t get deallocated.
I’m using about 2 GB of memory per execution, so I can execute the shared object two times and on the third run I get this error:

call to cuMemAlloc returned error 2: Out of memory

The command “pgaccelinfo” shows me also that the gpu is running out of memory. Here is the complete output of pgaccelinfo at the beginning:

CUDA Driver Version:           6000
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  331.62  Wed Mar 19 18:20:03 PDT 2014

Device Number:                 0
Device Name:                   Tesla K20c
Device Revision Number:        3.5
Global Memory Size:            5032706048
Number of Multiprocessors:     13
Number of SP Cores:            2496
Number of DP Cores:            832
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    705 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             2600 MHz
Memory Bus Width:              320 bits
L2 Cache Size:                 1310720 bytes
Max Threads Per SMP:           2048
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           1705460 microseconds
Current free memory:           4951470080
Upload time (4MB):             3316 microseconds ( 706 ms pinned)
Download time:                 1579 microseconds ( 678 ms pinned)
Upload bandwidth:              1264 MB/sec (5940 MB/sec pinned)
Download bandwidth:            2656 MB/sec (6186 MB/sec pinned)
PGI Compiler Option:           -ta=tesla:cc35
clGetDeviceIDs returns code -1

After one execution the amount of free memory is:
Current free memory: 2689093632
After the next execution it changes to
Current free memory: 497475584
And with the next execution I get the out of memory error.

You can see my OpenACC code in the first post. As you can see I’m using the copyin and copyout clause to manage the data transfer. In the OpenACC API is clearly said that the memory gets automatically allocated and deallocated at the end of the region.

I’ve tried to use the #pragma acc exit directive with a delete clause to free the memory without any success.
Also I’ve used the acc_allocs() and acc_frees() method which show that memory gets allocated two times and freed two times.

The only ways to free the memory are to terminate the program which loads and executes the shared objects or to call the acc_shutdown() method after each gpu kernel execution.

I think it is obvious that it is not really an option to terminate the program. The call of acc_shutdown is also not a solution for me because it takes about 2 seconds to call shutdown and initialize the gpu again.

So is there any other method to force the gpu to free all memory?
And why gets the memory not automatically freed?