cuMemAlloc failed: unspecified launch failure happens in a code that previously worked

Hello,

I have the following PyCuda code (that doesn’t work). Don’t pay attention to the number of block I use, it is just a test code and I just want it to compile at the moment. I know that with this number of blocks I will not multiply my matrices correctly.

The very strange thing with this code is that when I run it first, I have an error on line just before the end :

print(c_gpu.get())

The error is : LaunchError: cuMemcpyDtoH failed: unspecified launch failure

But when I launch it for a second time without having changed anything then the error is on line

a_gpu = gpuarray.to_gpu(a_cpu)

And it is :

LaunchError: cuMemAlloc failed: unspecified launch failure

Also, I downloaded some example scripts from pycuda documentation that work when I launch them at first, but if I launch this program and then the example program I will also have the error :

LaunchError: cuMemAlloc failed: unspecified launch failure

When I will try to allocate memory on the GPU.

Do you know what causes this problem ? I feel like it would be something like I want too much memory at first so it makes CUDA refuses any other allocation after it but I don’t see where I would have done any mistake here.

What’s more, my CUDA function perfectly works in C++ (I tried to “translate” everything to Pycuda and that is where the problem starts…).

#!python
    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    import numpy as np
    from pycuda import driver, compiler, gpuarray, tools
    import math
    
    # -- initialize the device
    import pycuda.autoinit
    
    kernel_code_template = """
    __global__  void MatMult(float* C, float* A, float*B, int dimAx, int dimBx, int dimCx, int dimCy)
    {
      int row = blockDim.y*blockIdx.y+threadIdx.y;
      int col = blockDim.x*blockIdx.x+threadIdx.x;
    
    	double Result = 0;
    
    	if (row<=dimCy-1 && col<=dimCx-1)
    	{
    		for (int k = 0; k < dimAx; k++)
    		{
    			Result += A[k + dimAx*row] * B[col + dimBx*k];
    		}
    
    		C[col + row*dimCx] = Result;
    	}
    }
    """
    
    MATRIX_SIZE=3
    
    # I create my variables :
    a_cpu=np.asarray([[0,1,2],[10,11,12],[20,21,22]])
    b_cpu=np.asarray([[0,0,0],[1,2,3],[4,8,12]])
    
    a_gpu = gpuarray.to_gpu(a_cpu) # LINE OF PROBLEM 2 ----------------------
    b_gpu = gpuarray.to_gpu(b_cpu)
    
    size_Ax=a_cpu.shape[1]
    size_Bx=b_cpu.shape[1]
    
    size_Ay=a_cpu.shape[0]
    
    size_Cx=size_Bx # Cx=Bx because of matrix product
    size_Cy=size_Ay # Cy=Ay
    # create empty gpu array for the result (C = A * B)
    c_gpu = gpuarray.empty((size_Cy, size_Cx), np.float32)
    
    # get the kernel code from the template 
    kernel_code=kernel_code_template
    # compile the kernel code 
    mod = compiler.SourceModule(kernel_code)
    
    # get the kernel function from the compiled module
    matrixmul = mod.get_function("MatMult")
    
    size_Ax=np.int32(size_Ax)
    size_Ax=size_Ax.astype(np.int32)
    
    size_AxGpu = pycuda.driver.mem_alloc(size_Ax.nbytes)
    pycuda.driver.memcpy_htod(size_AxGpu, size_Ax)
    
    size_Bx=np.int32(size_Bx)
    size_Bx=size_Bx.astype(np.int32)
    
    size_BxGpu = pycuda.driver.mem_alloc(size_Bx.nbytes)
    pycuda.driver.memcpy_htod(size_BxGpu, size_Bx)
    
    size_Cx=np.int32(size_Cx)
    size_Cx=size_Cx.astype(np.int32)
    
    size_CxGpu = pycuda.driver.mem_alloc(size_Cx.nbytes)
    pycuda.driver.memcpy_htod(size_CxGpu, size_Cx)
    
    size_Cy=np.int32(size_Cy)
    size_Cy=size_Cy.astype(np.int32)
    
    size_CyGpu = pycuda.driver.mem_alloc(size_Cy.nbytes)
    pycuda.driver.memcpy_htod(size_CyGpu, size_Cy)

# call the kernel on the card
    matrixmul(
        # inputs
        a_gpu, b_gpu, 
        # output
        c_gpu, 
        size_AxGpu,size_BxGpu,size_CxGpu,size_CyGpu,
        # (only one) block of MATRIX_SIZE x MATRIX_SIZE threads
        block = (MATRIX_SIZE, MATRIX_SIZE, 1),
        )
    
    print(c_gpu.get()) # LINE OF PROBLEM 1 ----------------------
    np.allclose(c_gpu.get())

If anyone knows the problem, I would gladly accept help.

[edit] : It is probably linked to the fact I’m a beginner in Python but I realized that I need to close and open again a python console to make things work. It doesn’t solve my problem but it can help to understand it maybe.

Thank you !

comments available on your cross-posting:

https://stackoverflow.com/questions/47803640/cumemalloc-failed-unspecified-launch-failure-happens-in-a-code-that-previously