CUDA IPC within a C Python module fails when combined with theano.

Hello everyone,

I have a problem that is at the intersection of a number of tools and technologies, so I am having a hard time solving it. The problem is described in great detail (including code samples) on a Stack Overflow post I made:

http://stackoverflow.com/questions/40700635/cuda-ipc-memcpy-mpi-fails-in-theano-works-in-pycuda

What I am trying to do is to perform a peer-to-peer memcpy in Python between GPU memory buffers that were declared through Theano’ss CudaNdarray. To this end, I have created a simple python module that takes as input the pointer to the array that backs a CudaNdarray buffer, creates a cudaIpcMemHandle_t with it, exchanges said handle with another process via MPI, gets a UVA pointer to the foreign buffer with cudaOpenMemHandle, and then attempts a cudaMemcpy to move the foreign buffer into a local array. Basically, after all of this, the two processes should have exchanged buffers. It sounds a little strange: using MPI two transfer handles and the trying to do a p2p Memcpy, when I could just use CUDA-aware MPI to perform the transfer for me. This I have done successfully, but my mission is to understand these tools on a deeper level and the task at hand is an exercise to that effect.

So, I decided to test the module in two different ways: using memory allocated using CudaNdarray and using pycuda to allocate memory with its mem_alloc method. The interesting thing is the pycuda example works but the Theano version does not! It doesn’t crash: the result of the cudaMemcpy is that the local buffer is filled with garbage (when it was initialized to all 0’s). It almost looks like the device is copying data from the wrong place. The buffers I use, however, are very large, so I would expect that if it were trying to copy from the wrong place, it would result in a seg fault

At the end of the days, my Python module receives as input just a pointer, so it seems to me that it shouldn’t matter how the memory is allocated, so long as the pointer is correct. Interestingly, if I try to move the buffers around with OpenMPI (which is CUDA-aware) it works, so I know the problem isn’t that I am not getting the underlying pointer correctly. Equally, if I do an equivalent test in C++ using the same function call as is used in my Python module, it also works; it looks like Theano is doing something extra that I don’t understand.

My only guess had to do with CUDA contexts: was it possible, I wondered, that Theano did some fancy footwork with pushing and popping contexts? However, after grepping around its code, I saw nothing that supported this theory: Theano seems to wait for the first application-level CUDA call to create a default context and just leaves it at that.

The only other hint of an idea I have is this: I noticed that when you launch Theano, it spawns several threads (I saw 5 in addition to the main thread, so 6 total). Is it possible that this is somehow responsible for my woes? My understanding after CUDA 4, threads in the same processes share a context, so I again don’t see how that might be an issue.

Here is the relevant portion of the Python C module. The _sillycopy is called by a function named “sillycopy” which is not shown here; sillycopy deals with some Python boilerplate to parse arguments and is the method that is actually exposed to the Python language when the module is imported.

void _sillycopy(float *source, float* dest, int n, MPI_Comm comm) {
 int localRank;
 int localSize;
 MPI_Comm_rank(comm, &localRank);
 MPI_Comm_size(comm, &localSize);

 //  Figure out which process is to the "left".
 // m() performs a mod and treats negative numbers
 // appropriately 
 int neighbor = m(localRank - 1, localSize); 

 // Create a memory handle for *source and do a
 // wasteful Allgather to distribute to other processes
 // (could just use an MPI_Sendrecv, but irrelevant right now)
 cudaIpcMemHandle_t *memHandles = new cudaIpcMemHandle_t[localSize];
 cudaIpcGetMemHandle(memHandles + localRank, source);
 MPI_Allgather(
  memHandles + localRank, sizeof(cudaIpcMemHandle_t), MPI_BYTE,
  memHandles, sizeof(cudaIpcMemHandle_t), MPI_BYTE,
  comm);

 // Open the neighbor's mem handle so we can do a cudaMemcpy
 float *sourcePtr;
 cudaIpcOpenMemHandle((void**)&sourcePtr, memHandles[neighbor], cudaIpcMemLazyEnablePeerAccess);    

 // Copy!
 cudaMemcpy(dest, sourcePtr, n * sizeof(float), cudaMemcpyDefault);
 cudaIpcCloseMemHandle(sourcePtr);
 delete [] memHandles;
}

Here is the pycuda code that does work.

mpi.init()
drv.init()
# Make sure each process uses a different GPU
dev = drv.Device(mpi.rank())  
ctx = dev.make_context()
atexit.register(ctx.pop)
shape = (2**26,)

# allocate host memory
a = np.ones(shape, np.float32)
b = np.zeros(shape, np.float32)

# allocate device memory
a_gpu = drv.mem_alloc(a.nbytes)
b_gpu = drv.mem_alloc(b.nbytes)

# copy host to device
drv.memcpy_htod(a_gpu, a)
drv.memcpy_htod(b_gpu, b)

# A few more host buffers
a_p = np.zeros(shape, np.float32)
b_p = np.zeros(shape, np.float32)

# Sanity check: this should fill a_p with 1's
drv.memcpy_dtoh(a_p, a_gpu)
# Verify that
print(a_p[0:10])

# Note that int(a_gpu) returns a pointer to the a_gpu's underlying
# device array as a long integer.  sillycopy parses this and casts
# it to a float*

sillymodule.sillycopy(
    int(a_gpu),
    int(b_gpu),
    shape[0])

# After this, b_p should have all one's
drv.memcpy_dtoh(b_p, b_gpu) 
print(c_p[0:10])

And now here is the Thano code that doesn’t work. Note that it is a nearly identical copy of the pycuda example.

import os
import simplempi as mpi
mpi.init()

# select's one gpu per process
os.environ['THEANO_FLAGS'] = "device=gpu{}".format(mpi.rank())
import theano.sandbox.cuda as cuda
import time
import numpy as np
import time
import sillymodule

shape = (2 ** 24, )

# Allocate host data
a = np.ones(shape, np.float32)
b = np.zeros(shape, np.float32)

# Allocate device data
a_gpu = cuda.CudaNdarray.zeros(shape)
b_gpu = cuda.CudaNdarray.zeros(shape)


# Copy from host to device
a_gpu[:] = a[:]
b_gpu[:] = b[:]

# Should print 1's as a sanity check
print(np.asarray(a_gpu[0:10]))

# a_gpu.gpudata is equivalent to pycuda's int(a_gpu): it just gets me the pointer
# to the underlying device memory.
 
sillymodule.sillycopy(
    a_gpu.gpudata,
    b_gpu.gpudata,
    shape[0])

# Should print 1's
print(np.asarray(b_gpu[0:10]))