Hi,
I am developing a C application that uses cublas to accelerate some matrix multiplications. Unfortunately I am running into some strange behavior when I link my application to .cu code.
Problem:
Application (using cublas) runs without problems when not linking to the cu code. When linking cu code, application fails at cudaMalloc and/or cublasCreate.
My C application follows the following structure. I just added the code I think was relevant.
Files:
* main.c
* matrixmul.c
* other c files
* kernels.cu
matrixmul.c Has functions using cublas routines (sgemm, saxpy), as well as routines to allocate device memory with cudaMalloc.
#include <cuda_runtime.h>
#include <cublas_v2.h>
//I use a global cublas handle
cublasHandle_t handle;
//This handle is initialized with a function
uint_8 handleinit()
{
cublasStatus_t cublase = cublasCreate(&handle);
if (cublase != CUBLAS_STATUS_SUCCESS) return 0;
return 1;
}
This works without problems and it’s compiled with:
gcc -Wall -Wextra -pedantic -std=c11 -g -I. -Ilibs/cuda/usr/local/cuda-11.5/include -c matrixmul.c -o matrixmul.o
gcc -Wall -Wextra -pedantic -std=c11 -g -I. -Ilibs/cuda/usr/local/cuda-11.5/include -c <other c files> -o <other o files>
ar rcs mylib.a matrixmul.o <other o files>
gcc -Wall -Wextra -pedantic -std=c11 -g -I. -Ilibs/cuda/usr/local/cuda-11.5/include -Llibs/cuda/usr/local/cuda-11.5/lib64 -o myapp main.c mylib.a -lcuda -lcudart -lcublas -lm -lpthread -lz -lstdc++ -lpthread
As previously mentioned, this works fine. The problem arises, when linking my application to cu code.
kernels.cu
#include <stdint.h>
__global__ void mse(float *A, float *B, uint64_t N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
atomicAdd(&B[N], fdividef(powf(A[i]-B[i], 2), N));
}
extern "C" void kernel(float *A, float *B, uint64_t N)
{
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
mse<<<threadsPerBlock, blocksPerGrid>>>(A, B, N);
cudaDeviceSynchronize();
}
The previous .cu code is called from matrix.c where it is declared as:
matrixmul.c
extern void kernel(float *A, float *B, uint64_t N);
void mse_compute(float *a, float *b, uint64_t n)
{
kernel(a,b,n);
}
The .cu code is compiled with:
nvcc -c kernels.cu -o kernels.o
This is then linked to the rest of my application with:
gcc -Wall -Wextra -pedantic -std=c11 -g -I. -Ilibs/cuda/usr/local/cuda-11.5/include -Llibs/cuda/usr/local/cuda-11.5/lib64 -o myapp main.c kernels.o mylib.a -lcuda -lcudart -lcublas -lm -lpthread -lz -lstdc++ -lpthread
When I wun my application that now includes the .cu code. It fails in either successfully creating the cublas handle, or my cudaMalloc’s fail with error code 222 which is cudaErrorUnsupportedPtxVersion. This, unfortunately I would greatly appreciate some help with.
Thanks in advance!!!
Forgot to add:
Ubuntu 20.04
Driver Version: 470.86 CUDA Version: 11.4
GTX 1050Ti