cuMemAlloc failed: unspecified launch failure happens in a code that previously worked


I have the following PyCuda code (that doesn’t work). Don’t pay attention to the number of block I use, it is just a test code and I just want it to compile at the moment. I know that with this number of blocks I will not multiply my matrices correctly.

The very strange thing with this code is that when I run it first, I have an error on line just before the end :


The error is : LaunchError: cuMemcpyDtoH failed: unspecified launch failure

But when I launch it for a second time without having changed anything then the error is on line

a_gpu = gpuarray.to_gpu(a_cpu)

And it is :

LaunchError: cuMemAlloc failed: unspecified launch failure

Also, I downloaded some example scripts from pycuda documentation that work when I launch them at first, but if I launch this program and then the example program I will also have the error :

LaunchError: cuMemAlloc failed: unspecified launch failure

When I will try to allocate memory on the GPU.

Do you know what causes this problem ? I feel like it would be something like I want too much memory at first so it makes CUDA refuses any other allocation after it but I don’t see where I would have done any mistake here.

What’s more, my CUDA function perfectly works in C++ (I tried to “translate” everything to Pycuda and that is where the problem starts…).

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    import numpy as np
    from pycuda import driver, compiler, gpuarray, tools
    import math
    # -- initialize the device
    import pycuda.autoinit
    kernel_code_template = """
    __global__  void MatMult(float* C, float* A, float*B, int dimAx, int dimBx, int dimCx, int dimCy)
      int row = blockDim.y*blockIdx.y+threadIdx.y;
      int col = blockDim.x*blockIdx.x+threadIdx.x;
    	double Result = 0;
    	if (row<=dimCy-1 && col<=dimCx-1)
    		for (int k = 0; k < dimAx; k++)
    			Result += A[k + dimAx*row] * B[col + dimBx*k];
    		C[col + row*dimCx] = Result;
    # I create my variables :
    a_gpu = gpuarray.to_gpu(a_cpu) # LINE OF PROBLEM 2 ----------------------
    b_gpu = gpuarray.to_gpu(b_cpu)
    size_Cx=size_Bx # Cx=Bx because of matrix product
    size_Cy=size_Ay # Cy=Ay
    # create empty gpu array for the result (C = A * B)
    c_gpu = gpuarray.empty((size_Cy, size_Cx), np.float32)
    # get the kernel code from the template 
    # compile the kernel code 
    mod = compiler.SourceModule(kernel_code)
    # get the kernel function from the compiled module
    matrixmul = mod.get_function("MatMult")
    size_AxGpu = pycuda.driver.mem_alloc(size_Ax.nbytes)
    pycuda.driver.memcpy_htod(size_AxGpu, size_Ax)
    size_BxGpu = pycuda.driver.mem_alloc(size_Bx.nbytes)
    pycuda.driver.memcpy_htod(size_BxGpu, size_Bx)
    size_CxGpu = pycuda.driver.mem_alloc(size_Cx.nbytes)
    pycuda.driver.memcpy_htod(size_CxGpu, size_Cx)
    size_CyGpu = pycuda.driver.mem_alloc(size_Cy.nbytes)
    pycuda.driver.memcpy_htod(size_CyGpu, size_Cy)

# call the kernel on the card
        # inputs
        a_gpu, b_gpu, 
        # output
        # (only one) block of MATRIX_SIZE x MATRIX_SIZE threads
        block = (MATRIX_SIZE, MATRIX_SIZE, 1),
    print(c_gpu.get()) # LINE OF PROBLEM 1 ----------------------

If anyone knows the problem, I would gladly accept help.

[edit] : It is probably linked to the fact I’m a beginner in Python but I realized that I need to close and open again a python console to make things work. It doesn’t solve my problem but it can help to understand it maybe.

Thank you !

comments available on your cross-posting: