Hello,
I have the following PyCuda code (that doesn’t work). Don’t pay attention to the number of block I use, it is just a test code and I just want it to compile at the moment. I know that with this number of blocks I will not multiply my matrices correctly.
The very strange thing with this code is that when I run it first, I have an error on line just before the end :
print(c_gpu.get())
The error is : LaunchError: cuMemcpyDtoH failed: unspecified launch failure
But when I launch it for a second time without having changed anything then the error is on line
a_gpu = gpuarray.to_gpu(a_cpu)
And it is :
LaunchError: cuMemAlloc failed: unspecified launch failure
Also, I downloaded some example scripts from pycuda documentation that work when I launch them at first, but if I launch this program and then the example program I will also have the error :
LaunchError: cuMemAlloc failed: unspecified launch failure
When I will try to allocate memory on the GPU.
Do you know what causes this problem ? I feel like it would be something like I want too much memory at first so it makes CUDA refuses any other allocation after it but I don’t see where I would have done any mistake here.
What’s more, my CUDA function perfectly works in C++ (I tried to “translate” everything to Pycuda and that is where the problem starts…).
#!python
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import math
# -- initialize the device
import pycuda.autoinit
kernel_code_template = """
__global__ void MatMult(float* C, float* A, float*B, int dimAx, int dimBx, int dimCx, int dimCy)
{
int row = blockDim.y*blockIdx.y+threadIdx.y;
int col = blockDim.x*blockIdx.x+threadIdx.x;
double Result = 0;
if (row<=dimCy-1 && col<=dimCx-1)
{
for (int k = 0; k < dimAx; k++)
{
Result += A[k + dimAx*row] * B[col + dimBx*k];
}
C[col + row*dimCx] = Result;
}
}
"""
MATRIX_SIZE=3
# I create my variables :
a_cpu=np.asarray([[0,1,2],[10,11,12],[20,21,22]])
b_cpu=np.asarray([[0,0,0],[1,2,3],[4,8,12]])
a_gpu = gpuarray.to_gpu(a_cpu) # LINE OF PROBLEM 2 ----------------------
b_gpu = gpuarray.to_gpu(b_cpu)
size_Ax=a_cpu.shape[1]
size_Bx=b_cpu.shape[1]
size_Ay=a_cpu.shape[0]
size_Cx=size_Bx # Cx=Bx because of matrix product
size_Cy=size_Ay # Cy=Ay
# create empty gpu array for the result (C = A * B)
c_gpu = gpuarray.empty((size_Cy, size_Cx), np.float32)
# get the kernel code from the template
kernel_code=kernel_code_template
# compile the kernel code
mod = compiler.SourceModule(kernel_code)
# get the kernel function from the compiled module
matrixmul = mod.get_function("MatMult")
size_Ax=np.int32(size_Ax)
size_Ax=size_Ax.astype(np.int32)
size_AxGpu = pycuda.driver.mem_alloc(size_Ax.nbytes)
pycuda.driver.memcpy_htod(size_AxGpu, size_Ax)
size_Bx=np.int32(size_Bx)
size_Bx=size_Bx.astype(np.int32)
size_BxGpu = pycuda.driver.mem_alloc(size_Bx.nbytes)
pycuda.driver.memcpy_htod(size_BxGpu, size_Bx)
size_Cx=np.int32(size_Cx)
size_Cx=size_Cx.astype(np.int32)
size_CxGpu = pycuda.driver.mem_alloc(size_Cx.nbytes)
pycuda.driver.memcpy_htod(size_CxGpu, size_Cx)
size_Cy=np.int32(size_Cy)
size_Cy=size_Cy.astype(np.int32)
size_CyGpu = pycuda.driver.mem_alloc(size_Cy.nbytes)
pycuda.driver.memcpy_htod(size_CyGpu, size_Cy)
# call the kernel on the card
matrixmul(
# inputs
a_gpu, b_gpu,
# output
c_gpu,
size_AxGpu,size_BxGpu,size_CxGpu,size_CyGpu,
# (only one) block of MATRIX_SIZE x MATRIX_SIZE threads
block = (MATRIX_SIZE, MATRIX_SIZE, 1),
)
print(c_gpu.get()) # LINE OF PROBLEM 1 ----------------------
np.allclose(c_gpu.get())
If anyone knows the problem, I would gladly accept help.
[edit] : It is probably linked to the fact I’m a beginner in Python but I realized that I need to close and open again a python console to make things work. It doesn’t solve my problem but it can help to understand it maybe.
Thank you !