cuMemAlloc failed: not initialized

Hey All!

I am trying to get a GPGPU instance up and running on Ubuntu 12.04. I’m using an Amazon EC2 G2 instance with a GRID card:

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GRID K520"
  CUDA Driver Version / Runtime Version          6.0 / 6.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 4096 MBytes (4294770688 bytes)
  ( 8) Multiprocessors, (192) CUDA Cores/MP:     1536 CUDA Cores
  GPU Clock rate:                                797 MHz (0.80 GHz)
  Memory Clock rate:                             2500 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           0 / 3
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 6.0, NumDevs = 1, Device0 = GRID K520
Result = PASS

I am able to compile and run the code included in samples.

I am using PyCuda 2013.1. As a regular user, I can run the samples included with that.
In a Python shell, I can individually input a simple source file line-by-line and it loads/computes/unloads from the GPU successfully.

My trouble comes in when I attempt to do this in a task-oriented context. I am using Django for the web framework, Celery as a task manager, and a simple function decorated with the @task decorator:

import time
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy

@task
def color_shift_average(image, shift, direction="fromWhite", log=1):
    if log == 1 :
        print ("----------> CUDA CONVERSION")

    px = numpy.array(image)
    print px
    px = px.astype(numpy.float32)

    d_px = cuda.mem_alloc(px.nbytes)
    cuda.memcpy_htod(d_px, px)

    #Kernel grid and block size
    BLOCK_SIZE = 1024
    block = (1024,1,1)
    checkSize = numpy.int32(im.size[0]*im.size[1])
    grid = (int(im.size[0]*im.size[1]/BLOCK_SIZE)+1,1,1)

    #Kernel text
    kernel = """

    #include <stdlib.h>
    #include <stdio.h>
    .........
    """
    #Compile and get kernel function
    mod = SourceModule(kernel)
    func = mod.get_function("foo")
    #   image,  L,  a, b
    func(d_px, numpy.float32(-10.0), numpy.float32(0.0), numpy.float32(0.0), checkSize, block=block,grid = grid)

    #Get back data from gpu
    bwPx = numpy.empty_like(px)
    cuda.memcpy_dtoh(bwPx, d_px)
    bwPx = (numpy.uint8(bwPx))
    .... (and so on) ...

I get a cuMemAlloc failed: not initialized error on the line containing

d_px = cuda.mem_alloc(px.nbytes)

Things I have tried:

  • Doing the imports on a task-level basis instead of a global basis
  • Decreasing concurancy to 1 thread
  • Checking permissions on /dev/nv* (All 666)
  • Explicitly initializing a CUDA context instead of using the autoinit.py.
    import pycuda.driver as cuda
    
        # Initialize CUDA
        cuda.init()
    
        from pycuda.tools import make_default_context
        global context
        context = make_default_context()
        device = context.get_device()
    
        def _finish_up():
            global context
            context.pop()
            context = None
    
        from pycuda.tools import clear_context_caches
        clear_context_caches()
    
        import atexit
        atexit.register(_finish_up)
    
  • Reaching out to the PyCuda Developers/Mailing List

Because my code runs in a stand-alone python file (ie, not in the django/celery/apache-wsgi environment) I know it is not an issue with my kernel. I assume this is a permissions/threading/user issue, but I am unsure on how to proceed testing that assumption and fixing it. I could use some expertise here.

Thanks!
-Forrest