cudaMalloc fails (also cublas) when linking .cu object file

jregalado · January 6, 2022, 4:05pm

Hi,

I am developing a C application that uses cublas to accelerate some matrix multiplications. Unfortunately I am running into some strange behavior when I link my application to .cu code.

Problem:
Application (using cublas) runs without problems when not linking to the cu code. When linking cu code, application fails at cudaMalloc and/or cublasCreate.

My C application follows the following structure. I just added the code I think was relevant.

Files:
* main.c
* matrixmul.c
* other c files
* kernels.cu

matrixmul.c Has functions using cublas routines (sgemm, saxpy), as well as routines to allocate device memory with cudaMalloc.

#include <cuda_runtime.h>
#include <cublas_v2.h>
 
 //I use a global cublas handle
 cublasHandle_t handle;
 //This handle is initialized with a function
 uint_8 handleinit()
 {
     cublasStatus_t cublase = cublasCreate(&handle);
     if (cublase != CUBLAS_STATUS_SUCCESS) return 0;
     return 1;
 }

This works without problems and it’s compiled with:

gcc -Wall -Wextra -pedantic -std=c11 -g -I. -Ilibs/cuda/usr/local/cuda-11.5/include -c matrixmul.c -o matrixmul.o
gcc -Wall -Wextra -pedantic -std=c11 -g -I. -Ilibs/cuda/usr/local/cuda-11.5/include -c <other c files> -o <other o files>
ar rcs mylib.a matrixmul.o <other o files>
gcc -Wall -Wextra -pedantic -std=c11 -g -I. -Ilibs/cuda/usr/local/cuda-11.5/include -Llibs/cuda/usr/local/cuda-11.5/lib64 -o myapp main.c mylib.a -lcuda -lcudart -lcublas -lm -lpthread -lz  -lstdc++ -lpthread

As previously mentioned, this works fine. The problem arises, when linking my application to cu code.

kernels.cu

#include <stdint.h>

__global__ void mse(float *A, float *B, uint64_t N)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        atomicAdd(&B[N], fdividef(powf(A[i]-B[i], 2), N));
}

extern "C" void kernel(float *A, float *B, uint64_t N)
{
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    mse<<<threadsPerBlock, blocksPerGrid>>>(A, B, N);
    cudaDeviceSynchronize();
}

The previous .cu code is called from matrix.c where it is declared as:

matrixmul.c

extern void kernel(float *A, float *B, uint64_t N);
 
void mse_compute(float *a, float *b, uint64_t n)
{
    kernel(a,b,n);
}

The .cu code is compiled with:
nvcc -c kernels.cu -o kernels.o

This is then linked to the rest of my application with:

gcc -Wall -Wextra -pedantic -std=c11 -g -I. -Ilibs/cuda/usr/local/cuda-11.5/include -Llibs/cuda/usr/local/cuda-11.5/lib64 -o myapp main.c kernels.o mylib.a -lcuda -lcudart -lcublas -lm -lpthread -lz  -lstdc++ -lpthread

When I wun my application that now includes the .cu code. It fails in either successfully creating the cublas handle, or my cudaMalloc’s fail with error code 222 which is cudaErrorUnsupportedPtxVersion. This, unfortunately I would greatly appreciate some help with.

Thanks in advance!!!

Forgot to add:
Ubuntu 20.04
Driver Version: 470.86 CUDA Version: 11.4
GTX 1050Ti

jregalado · January 6, 2022, 5:03pm

Si I completely and totally brain farted with this one. You know how after the holiday break it takes some time to get your wits together? Well, I am just barely getting my mental abilities back.

After a system update, my drivers regressed to 470.86 which use CUDA 11.4. I had been developing with 495.46 drivers which use CUDA 11.5. I do not have a system install of the cuda toolkit, I manage the deb package manually. Thus the cudaErrorUnsupportedPtxVersion. After reinstalling 495.46 drivers everything is working now.

system · January 20, 2022, 5:03pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.