Problem initializing CUBLAS on GTX 480 Huge delay calling cublasInit() on GTX 480

We’ve just bought a GTX 480 and are having trouble using CUBLAS with it. When we run the simpleCUBLAS SDK program there is a delay of over 90 seconds when calling cublasInit(). This has only affected us since we started using the 480.

System information:

OS: Windows XP SP3
Driver version: 197.75
CUDA toolkit version: 3.0
CUDA SDK version: 3.0
Compiler for host code: Visual Studio 2005
CPU: Intel Xeon X3350@2.66GHz
4GBytes RAM

I’m guessing it’s caused by some kind of hardware/software mismatch. Any help would be appreciated.

Phil McLauchlan

Chief Scientist
Mirriad Ltd

PS I get the same delay if I just use the lines

cudaSetDevice(0);
cudaThreadSynchronize();

at the start of main(). So the problem is not specific to CUBLAS.

I’ve just downloaded and tried the Cuda 3.1 beta. Still get the huge delay starting up Cuda. Now I’m on

Driver 257.15
Cuda 3.1
Cuda SDK 3.1

I’ve tried adding cuInit(0) at the start, no effect. I’m not sure what else to try. Obviously there’s something wrong because I see the problem in one of the SDK examples (simpleCUBLAS).

I came across the same problem on my linux box, and it’s normal :

my first try with a GTX480 was on my code compiled for an sm_13 arch, so the runtime driver transform the ptx code ta a sm_20 version and this operation can take forever for huge code. As soon i compile for sm_20 arch the delay disappear.

Alexish, this looks like the correct thing. Makes sense to me. Only problem is I already have defined sm_20…

Yes, but does CUBLAS actually contain an sm_20 build set of functions? (keeping in mind that CUBLAS is supplied as a binary only library). So what you do in your own code might make no difference… Just a guess.

I found the problem! When I first built the code I was using sm_11 and set the CUDA_FORCE_PTX_JIT environment variable to make sure I would be running sm_20 cuda binaries (this forces the compiled binaries to be translated to the actual harware architecture). When I switched to sm_20 I forgot to switch the environment variable off. I’m still surprised that the forcing option had any effect, because with sm_20 set the compiled binary should match the GTX480 hardware. Maybe someone from nVidia could explain this? Anyway I removed the environment variable and now the delay is gone.

So thanks a lot to Alexish for pointing me in the right direction.

CUDA_FORCE_PTX_JIT instructs the driver to ignore binary code contained in executables and instead forces it to always JIT directly from PTX. In other words, it overrides normal behavior where the driver tries to locate matching binary code first, and only JITs from PTX when no matching binary code is found.

We have this same problem with CUBLAS on Ocelot. We do lazy translation (code generation from PTX to something else only when you try to execute a specific kernel), but we still have to parse all of the PTX of every program that is included to figure out which kernels are in which libraries. If you load a library like CUBLAS with a huge number of functions, it can take 10s of seconds to load and parse everything even if you only end up calling a single kernel. This takes 50s on my machine, but it is in the ballpark of the 90s that you are seeing.

We could actually cut this down significantly (by 50x or more) if nvidia would embed symbol tables (concise lists of kernels, functions, variables, and their line numbers) in the fat-binaries that are embedded in CUDA programs. There are some other tricky optimizations that we could do to cut this down, but I am reluctant to try them because a symbol table would be such a simpler solution.

Given the GPU side of the toolchain has moved to using elf containers, I am amazed they don’t already write symbol tables into cubin files…

They may actually exist for native binaries (not PTX), but PTX is more like an assembly language (there is no binary format) and you can get the same information by parsing it. Kernels are grouped together into modules (similar to object binary files) that contain kernels from the same compilation unit (the same .cu file). Right now, I don’t think that there is a reliable way to determine which kernels are in which module without parsing the entire module. There may be ELF entries associated with each PTX module, but we ignore them in ocelot because they don’t contain the actual PTX code (and because we would have to reverse engineer the format that nvidia uses to embed them in cuda programs). Based on these performance results, it looks like the driver is also ignoring them.

Even if the ELF entries contained symbol tables that name the kernels and symbols in each PTX module, it would still be necessary to parse an entire module at a time, which would be a lot better in cases where libraries contained many modules, but it would still be less than ideal for cases were a single module contained many kernels.

Alright so I went back and checked to make sure that we hadn’t missed a symbol table.

A fat binary contains the following fields:

typedef struct __cudaFatCudaBinaryRec {

	unsigned long			magic;

	unsigned long			version;

	unsigned long			gpuInfoVersion;

	char*				   key;

	char*				   ident;

	char*				   usageMode;

	__cudaFatPtxEntry			 *ptx;

	__cudaFatCubinEntry		   *cubin;

	__cudaFatDebugEntry		   *debug;

	void*				  debugInfo;

	unsigned int				   flags;

	__cudaFatSymbol			   *exported;

	__cudaFatSymbol			   *imported;

	struct __cudaFatCudaBinaryRec *dependends;

	unsigned int				   characteristic;

	__cudaFatElfEntry			 *elf;

} __cudaFatCudaBinary;

binary->ptx is the actual PTX source. It looks like binary->exported and binary->imported would be symbol tables. However, they are always NULL pointers at runtime.

The ELF section is a little bit more informative. I wrote a simple program to extract all of the elf sections from a cuda binary as it executes. For this simple module:

#include <stdio.h>

extern "C" __global__ void sequence(int *A, int N) {

	int i = threadIdx.x + blockIdx.x * blockDim.x;

	if (i < N) {

		A[i] = 2*i;

	}

}

extern "C" __global__ void testShr(int *A, const int *B) {

	int i = threadIdx.x + blockIdx.x * blockDim.x;

	int b;

	__shared__ int storage[256];

	

	storage[threadIdx.x] = B[i];

	__syncthreads();

	if (i & 1) {

		b = storage[threadIdx.x ^ 1] * 2;

	}

	else {

		b = storage[threadIdx.x ^ 1] * 3;

	}

	A[i] = b;

}

int main(int argc, char *arg[]) {

	const int N = 1024;

	int *A_host, *A_gpu =0;

	int errors = 0;

	size_t bytes = sizeof(int)*N;

	if (cudaMalloc((void **)&A_gpu, bytes) != cudaSuccess) {

		printf("cudaMalloc() - failed to allocate %d bytes on device\n", (int)bytes);

		return -1;

	}

	A_host = (int *)malloc(bytes);

	for (int i = 0; i < N; i++) {

		A_host[i] = -1;

	}

	

	cudaMemcpy(A_gpu, A_host, bytes, cudaMemcpyHostToDevice);

	dim3 grid((N+31)/32,1);

	dim3 block(32, 1);

	sequence<<< grid, block >>>(A_gpu, N);

	cudaMemcpy(A_host, A_gpu, bytes, cudaMemcpyDeviceToHost);

	for (int i = 0; i < N; i++) {

		if (A_host[i] != 2*i) {

			++errors;

		}

	}

	int *B_gpu = 0;

	if (cudaMalloc((void **)&B_gpu, bytes) != cudaSuccess) {

		printf("cudaMalloc() - failed to allocate %d bytes on device\n", (int)bytes);

		cudaFree(A_gpu);

		free(A_host);

		return -1;

	}

	

	sequence<<< grid, block >>>(A_gpu, N);

	testShr<<< grid, block >>>(B_gpu, A_gpu);

	

	if (cudaMemcpy(A_host, B_gpu, bytes, cudaMemcpyDeviceToHost) != cudaSuccess) {

		printf("cudaMemcpy(A, B) - failed to copy %d bytes from device to host\n", (int)bytes);

		cudaFree(A_gpu);

		cudaFree(B_gpu);

		free(A_host);

	}

	

	for (int i = 0; (errors < 5) && i < N; ++i) {

		int b;

		if (i & 1) {

			b = (i ^ 1) * 2 * 2;

		}

		else {

			b = (i ^ 1) * 2 * 3;

		}

		int got = A_host[i];

		if (b != got) {

			printf("ERROR 1 [%d] - expected: %d, got: %d\n", i, b, got);

			++errors;

		}

	}

	cudaFree(B_gpu);

	cudaFree(A_gpu);

	free(A_host);

	if (errors) {

		printf("Pass/Fail : Fail\n");

	}

	else {

		printf("Pass/Fail : Pass\n");

	}

	return 0;

}

I get two ELF binaries, the second of which is blank. When I look at the symbol tables using objdump I get the following:

SYMBOL TABLE:

00000000 l	d  *ABS*	00000000 .shstrtab

00000000 l	d  *ABS*	00000000 .strtab

00000000 l	d  *ABS*	00000000 .symtab

00000000 l	d  *UND*	00000000 

00000000 l	d  *UND*	00000000 

00000000 l	d  .text.testShr	000000a0 .text.testShr

00000000 l	d  .nv.info.testShr	00000000 .nv.info.testShr

00000000 l	d  .nv.info	00000000 .nv.info

00000000 l	d  .text.sequence	00000050 .text.sequence

00000000 l	d  .nv.info.sequence	00000000 .nv.info.sequence

00000000 l	d  .nv.shared.testShr	00000000 .nv.shared.testShr

00000000 l	d  .nv.constant16.testShr	00000000 .nv.constant16.testShr

00000000 l	d  .nv.constant0.testShr	00000000 .nv.constant0.testShr

00000000 l	d  .nv.constant0.sequence	00000000 .nv.constant0.sequence

00000000 g	 F .text.testShr	000000a0 0x10 testShr

00000000 g	 F .text.sequence	00000050 0x10 sequence

The two kernel names (testShr and sequence) are listed prominently as global symbols. So this could definitely be a way to get this information. This may appear in a future release of ocelot… Thanks avidday for the suggestion.

One final update on this. It turns out that subsequent calls to cudaRegisterFunction create a mapping between kernel symbols and the module that they are contained in. Using this, I updated Ocelot to only load a module when either a kernel, global, or texture is accessed by a cuda function. This reduced the execution time for the SimpleCUBLAS example from the CUDA SDK from ~50s to ~.05s using full PTX translation for every kernel.

For anyone at nvidia, I would recommend including something similar in a future release… If anyone is interested, here are the changes necessary to enable it:
http://code.google.com/p/gpuocelot/source/detail?r=592