CUDA shared memory CNN (convolutional neural network)

rududoo · July 20, 2017, 10:14am

Hi,

I have a question regarding a CNN model that is loaded into VRAM.
The model is loaded from a process when it first starts. (the process is a linux application written in c++ compiled with gcc and linked with cuda libraries).

Is it possible to share the model loaded in VRAM among different processes? I want to launch another process that uses the same model and not to load again the model in VRAM (there will be two exact models in VRAM occupying twice the VRAM). I would like to use something like shared memory IPC in Linux.(I create a shared memory segment in a process, which somehow maps the loaded memory in VRAM and from another process I access the shared memory segment created in the previous process and so I have access to the CNN model)

Regards,
Radu

Robert_Crovella · July 20, 2017, 11:31am

Yes, CUDA has an IPC facility, that allows a device pointer and the data (allocation) it represents, to be shared with CUDA code running in another (64-bit linux) process.

There is a CUDA IPC API, and a CUDA IPC sample code demonstrating the necessary concepts.

rududoo · July 20, 2017, 11:37pm

Thanks for the response.

I’ve updated the code and now, I have a main process that creates the CNN.
For some layers of CNN - which are cudaMalloc- ed, I created handlers, I saved those handlers in a file (from the main process)then I launched another process that read that file and used the model already loaded. So the IPC apparently worked.

I say apparently because if I launch another process that reads those handlers and links to the model in VRAM and yet another one, at some point in time I get an error like - “misaligned address: Resource temporarily unavailable” in one of the processes(not the main one) and it crashes. And after a while the error will be thrown by another process that was launched and it crashes too.

What could be the reason for this kind of misalignment “after a while”?

Could it be due to the fact that the handlers will point in the future to some invalid data?(I mean maybe I made some wrong assumptions and something does change in the CNN model in the memory. As I said in the beginning I am saving only some layers from the model -e.g I save the convolution layers but not the max pool layers. I assumed that everything that was cudaMalloc-ed would never be freed because I didn’t notice any place in the code where the model might be changed.)

Robert_Crovella · July 21, 2017, 12:15am

resource temporarily unavailable may arise due to an inability to fork() or otherwise create a new process. I can’t immediately suggest ideas for “misaligned address” but this may be a cascade of errors - the process issue may give rise to a data interpretation problem.

This is really just speculation. I would investigate limits on creating new process (actual limits, resource limits, running out of a resource like swap space, etc.) and carefully checking on errors reported by either CUDA (are you doing careful CUDA error checking?) or any system calls you may be making.

If that turns up nothing useful, I’m out of ideas and I would suggest a minimal reproducer might be in order. Such a problem may depend on exact OS and exact OS settings (e.g. resource limits) and maybe even other things like amount of system memory.

If all of this fails, consider having a master process that fields work from other processes by ordinary linux IPC, then issues that work to the GPU from a single process.

rududoo · July 21, 2017, 12:43am

Regarding the resources, without this ‘hack’ I could have up to 8 processes started without any problems. The only issue here was that each process loaded the same convolutional net, and that resulted in a VRAM usage of 7GB from 8G.
So I thought it would be a good idea to share the model among the processes.
Now, with this partial share (as I told you I am not mapping the whole net- I wanted to check incrementally if it works) I would spare at leas 3G of VRAM for 8 processes but as I discovered, this sharing might not be trivial.

I am trying to find a starting point to debug this or at least to understand the crash.
As an observation, when the crash occurs I can notice in the system log:

(GPC 3, TPC 3): Physical Multiple Warp Errors
[234355.055134] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x51de48=0x3000f 0x51de50=0x24 0x51de44=0xd3eff2 0x51de4c=0x17f
[234355.055173] NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 4): Misaligned Address
[234355.055177] NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Global Exception on (GPC 3, TPC 4): Physical Multiple Warp Errors
[234355.055180] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x51e648=0xf 0x51e650=0x24 0x51e644=0xd3eff2 0x51e64c=0x17f
[234355.056566] NVRM: Xid (PCI:0000:01:00): 43, Ch 00000030, engmask 00000101

If I start the main process and then after 30 minutes I start a ‘slave’ that reads the handlers saved by main, everything looks fine- I mean the model works perfect in the slave and the process occupies less VRAM than before sharing.(I waited 30 minutes because I thought that the handlers I would use in slave, would be 'inconsistent, inferring that in these 30 minutes the main process may have changed the model in the VRAM memory such that the saved handlers would be invalid for the slave)

Keefedc · July 21, 2017, 3:06am

of course cuda will allow you to do that

rududoo · July 21, 2017, 9:13am

I saw something interesting when I used cuda-memcheck.
I said in my previous post that if I launched only one process after the main one, I didn’t see any crash.
With cuda-memcheck for the second process (which reads the handlers and basically does not reload all the net in VRAM) the second process will crash in a couple of seconds with ‘misaligned error’. The trace is like this:

======== Misaligned Shared or Local Address
========= at 0x00000570 in maxwell_scudnn_128x32_relu_small_nn
========= by thread (32,0,0) in block (40,0,0)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x213a85]
========= Host Frame:/usr/local/lib/libcudnn.so.6 [0x9b6241]
========= Host Frame:/usr/local/lib/libcudnn.so.6 [0x9d5053]
========= Host Frame:/usr/local/lib/libcudnn.so.6 [0x6f0a4e]
========= Host Frame:/usr/local/lib/libcudnn.so.6 [0x3d35b7]
========= Host Frame:/usr/local/lib/libcudnn.so.6 [0x3d541b]
========= Host Frame:/usr/local/lib/libcudnn.so.6 [0x360f3d]
========= Host Frame:/usr/local/lib/libcudnn.so.6 [0x5ca21]
========= Host Frame:/usr/local/lib/libcudnn.so.6 (cudnnConvolutionForward + 0x69) [0x5d2d9]
========= Host Frame:/home/rududoo/app_no_gpu [0x2b15b5]
========= Host Frame:/home/rududoo/app_no_gpu [0x2b4213]
========= Host Frame:/home/rududoo/app_no_gpu [0x2b595d]
========= Host Frame:/home/rududoo/app_no_gpu [0x18ef4d]
========= Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x1073dd]

CUDA Error: unspecified launch failure
CUDA Error: unspecified launch failure: Resource temporarily unavailable
OpenCV Error: Gpu API call (NCV Assertion Failed: cudaError_t=29, file=/home/rududoo/opencv2.4.13/modules/gpu/src/nvidia/core/NCV.cu, line=487
) in NCVDebugOutputHandler, file /home/rududoo/opencv2.4.13/modules/gpu/src/cascadeclassifier.cpp, line 173
OpenCV Error: Gpu API call (NCV Assertion Failed: cudaError_t=29, file=/home/rududoo/opencv2.4.13/modules/gpu/src/nvidia/core/NCV.cu, line=487
) in NCVDebugOutputHandler, file /home/rududoo/opencv2.4.13/modules/gpu/src/cascadeclassifier.cpp, line 173
terminate called after throwing an instance of ‘cv::Exception’
what(): /home/rududoo/opencv2.4.13/modules/gpu/src/cascadeclassifier.cpp:173: error: (-217) NCV Assertion Failed: cudaError_t=29, file=/home/rududoo/opencv2.4.13/modules/gpu/src/nvidia/core/NCV.cu, line=487
in function NCVDebugOutputHandler

========= Error: process didn’t terminate successfully
========= Internal error (20)
========= No CUDA-MEMCHECK results found

So there is an issue with this IPC implementation and the issue appears immediately when using the cuda-memcheck tool.

rududoo · July 21, 2017, 11:31am

Right now I am stuck.

Basically the IPC is like this:

if (!mattach) {
net.workspace = cuda_make_array(0, (workspace_size-1)/sizeof(float)+1);
cudaIpcMemHandle_t handle;
memset(&handle, 0, sizeof(handle));
cudaIpcGetMemHandle(&handle, net.workspace);

		for (int i=0; i < sizeof(handle); i++){
			int ret;
			ret = fprintf(fp,"%c", handle.reserved[i]);
			if (ret != 1)
				printf("ret = %d\n", ret);
		}
	} else {
			cudaIpcMemHandle_t handle;
			memset(&handle, 0, sizeof(handle));
			int ret;
			printf("sizeof handle %d\n", sizeof(handle));
			for (int i = 0; i < sizeof(handle); i++){
				ret = fscanf(fp,"%c", handle.reserved+i);
				if (ret == EOF)
					printf("received EOF\n");
				else if (ret != 1)
					printf("fscanf returned %d\n", ret);
		  }
		  cudaIpcOpenMemHandle((void **)&net.workspace, handle, cudaIpcMemLazyEnablePeerAccess);
	}
	fclose(fp);

So what happens here is that the main process will go on ‘then’ and the other launched process will go on ‘else’ branch. (will attach to VRAM using the handle).

Before this code, there was only this line: net.workspace = cuda_make_array(0, (workspace_size-1)/sizeof(float)+1); - so for each process new space was allocated from VRAM.

I do not understand how and why the address becomes unaligned.

The crash will follow further after the network is loaded, during the call to a ‘prediction function’ which will pass a frame through the net for prediction.(see the call stack from previous post)

Topic		Replies	Views
How to access gpu memory between processes CUDA Programming and Performance	10	3422	August 4, 2023
How to improve the performance of using CUDA IPC shared memory? CUDA Programming and Performance cuda	5	692	October 23, 2024
GPU Inter-Process Communications(IPC) question CUDA Programming and Performance	13	15698	January 4, 2023
Registering POSIX-CPU shared memory to CUDA with cudaHostRegister CUDA Programming and Performance	5	297	July 16, 2024
Share GPU/host pinned memory between host processes CUDA Programming and Performance	5	4148	March 7, 2012
Is it possible for a unified virtual address (UVA) to be shared by difference processes or difference gpus? CUDA Programming and Performance	4	1087	May 19, 2022
How to make host pinned shared memory across process fork(2)? CUDA Programming and Performance	14	5490	January 6, 2015
nvcc -arch sm_20 causes access violations in shared memory CUDA Programming and Performance	8	2650	March 30, 2013
Access unified memory from a different process CUDA Programming and Performance	5	654	August 31, 2022
CUDA IPC vs NVSHMEM for shared memory between applications CUDA Programming and Performance cuda , hpc , image-processing	5	5106	February 6, 2023

CUDA shared memory CNN (convolutional neural network)

Related topics