strange behaviour

Hello, guys!

I’d like to ask you to help me with a problem I don’t exactly know.

I am running a computer with an Nvidia Quadro FX 570 graphics card. And I’d like to start with CUDA-programming. But it doesn’t work. I’ve installed the following packages:



and the driver for my card: “”.

I searched with for a tutorial and found one ( But some of the code from this tutorial doesn’t run. The code from the first tutorial works fine. But from the secon does not. I modified to get some error-messages and informations about what doesn’t work:


#include <stdio.h>

#include <assert.h>

#include <cuda.h>

void incrementArrayOnHost(float* a, int N)


	int i;

	for(i = 0; i < N; i++)


		a[i] = a[i] + 1.f;



__global__ void incrementArrayOnDevice(float* a, int N)


	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	if(idx < N)


		a[idx] = a[idx] + 1;



int main(int argc, char* argv[])


	float* a_h;

	float* b_h;	// pointers to host memory

	float* a_d;	// pointer to device memory

	int i;

	int N = 10;

	size_t size = N * sizeof(float);

	int blockSize = 1;

	// allocate arrays on host

	a_h = (float *)malloc(size);

	b_h = (float *)malloc(size);

	// allocate array on device

	cudaMalloc((void **) &a_d, size);

	// initialization of host data

	for(i = 0; i < N; i++)


		a_h[i] = (float)i;


	for(i = 0; i < N; i++)


		printf("a_h[%d]: %f\t", i, a_h[i]);

		printf("b_h[%d]: %f\n", i, b_h[i]);


	// copy data from host to device

	cudaMemcpy(a_d, a_h, sizeof(float) * N, cudaMemcpyHostToDevice);

	// do calculation on host

	incrementArrayOnHost(a_h, N);

	// do calculation on device:

	// Part 1 of 2. Compute execution configuration

	int nBlocks = N / blockSize + (N % blockSize == 0 ? 0 : 1);

	// Part 2 of 2. Call incrementArrayOnDevice kernel

	incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N);

	printf("incrementArrayOnDevice(%s);\n", cudaGetErrorString(cudaGetLastError()));

	// Retrieve result from device and store in b_h

	cudaMemcpy(b_h, a_d, sizeof(float) * N, cudaMemcpyDeviceToHost);

	for(i = 0; i < N; i++)


		printf("a_h[%d]: %f\t", i, a_h[i]);

		printf("b_h[%d]: %f\n", i, b_h[i]);


	// check results

	for(i = 0; i < N; i++)


		assert(a_h[i] == b_h[i]);


	// cleanup





And I got the following output:

user@Linux:~/Desktop/CUDA$ ./a.out

a_h[0]: 0.000000		b_h[0]: 0.000000

a_h[1]: 1.000000		b_h[1]: 0.000000

a_h[2]: 2.000000		b_h[2]: 0.000000

a_h[3]: 3.000000		b_h[3]: 0.000000

a_h[4]: 4.000000		b_h[4]: 0.000000

a_h[5]: 5.000000		b_h[5]: 0.000000

a_h[6]: 6.000000		b_h[6]: 0.000000

a_h[7]: 7.000000		b_h[7]: 0.000000

a_h[8]: 8.000000		b_h[8]: 0.000000

a_h[9]: 9.000000		b_h[9]: 0.000000

incrementArrayOnDevice(invalid device function );

a_h[0]: 1.000000		b_h[0]: 0.000000

a_h[1]: 2.000000		b_h[1]: 1.000000

a_h[2]: 3.000000		b_h[2]: 2.000000

a_h[3]: 4.000000		b_h[3]: 3.000000

a_h[4]: 5.000000		b_h[4]: 4.000000

a_h[5]: 6.000000		b_h[5]: 5.000000

a_h[6]: 7.000000		b_h[6]: 6.000000

a_h[7]: 8.000000		b_h[7]: 7.000000

a_h[8]: 9.000000		b_h[8]: 8.000000

a_h[9]: 10.000000	   b_h[9]: 9.000000

a.out: int main(int, char**): Assertion `a_h[i] == b_h[i]' failed.



No compiling problems or anything. Then I tried to compile and run the examples from the SDK. And only some of them ran.

I show you some of the console outputs, so you might get an idea of what has gone wrong:

user@Linux:/opt/NVIDIA_CUDA_SDK/C/bin/linux/release$ ./3dfd

3DFD running on: Quadro FX 570

Total GPU Memory: 255.3125 MB


Unable to allocate 351.5625 Mbytes of GPU memory


user@Linux:/opt/NVIDIA_CUDA_SDK/C/bin/linux/release$ ./dct8x8

CUDA sample DCT/IDCT implementation


Loading test image: barbara.bmp... [512 x 512]... Success

Running Gold 1 (CPU) version... Success

Running Gold 2 (CPU) version... Success

cudaSafeCall() Runtime API error in file <>, line 195 : feature is not yet implemented.

Running CUDA 1 (GPU) version...

user@Linux:/opt/NVIDIA_CUDA_SDK/C/bin/linux/release$ ./Mandelbrot

[ CUDA Mandelbrot & Julia Set ]

Initializing GLUT...

Loading extensions: No error

OpenGL window created.

> Compute SM 1.1 Device Detected

> Device 0: <Quadro FX 570>

Data initialization done.

Starting GLUT main loop...

Press [s] to toggle between GPU and CPU implementations

Press [j] to toggle between Julia and Mandelbrot sets

Press [r] or [R] to decrease or increase red color channel

Press [g] or [G] to decrease or increase green color channel

Press [b] or [B] to decrease or increase blue color channel

Press [e] to reset

Press [a] or [A] to animate colors

Press [c] or [C] to change colors

Press [d] or [D] to increase or decrease the detail

Press [p] to record main parameters to file params.txt

Press [o] to read main parameters from file params.txt

Left mouse button + drag = move (Mandelbrot or Julia) or animate (Julia)

Press [m] to toggle between move and animate (Julia) for left mouse button

Middle mouse button + drag = Zoom

Right mouse button = Menu

Press [?] to print location and scale

Press [q] to exit

Creating GL texture...

Texture created.

Creating PBO...

cudaSafeCall() Runtime API error in file <Mandelbrot.cpp>, line 892 : feature is not yet implemented.

cudaSafeCall() Runtime API error in file <Mandelbrot.cpp>, line 468 : feature is not yet implemented.


And here the output from the deviceQuery:

user@Linux:/opt/NVIDIA_CUDA_SDK/C/bin/linux/release$ ./deviceQuery

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "Quadro FX 570"

  CUDA Driver Version:						   0.0

  CUDA Runtime Version:						  2.30

  CUDA Capability Major revision number:		 1

  CUDA Capability Minor revision number:		 1

  Total amount of global memory:				 267714560 bytes

  Number of multiprocessors:					 16

  Number of cores:							   128

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									0.92 GHz

  Concurrent copy and execution:				 Yes

  Run time limit on kernels:					 No

  Integrated:									Yes

  Support host page-locked memory mapping:	   Yes

  Compute mode:								  Default (multiple host threads can use this device simultaneously)


Press ENTER to exit...


Do you have any idea, what I need to do to get this CUDA-thing working?

Could it be that my device-driver is too old and that I need to install a new one? If so, how do I uninstall the old one?

Could it be that the other hardware from my computer is causing the problem?

I’d be very happy to have a useful answer!


BTW: My OS is:

Linux Linux #1 SMP Tue Sep 25 20:41:25 BST 2007 x86_64 x86_64 x86_64 GNU/Linux

And the distribution is called Slamd64 12.0

That driver is far too old to work. For the CUDA 2.3 toolkit version you are using, you need a 190 series driver (190.53 is the current stable release driver). Until you update the driver, nothing is going to work.

Thanks for your very quick reply. But I have a (stupid) question. How am I going to do the update? Will the new driver recognize the old one an make a “clean” installation, so that I don’t have trouble with two drivers installed at the same time or do I have to uninstall the old driver first?


The driver intaller will detect the presence of older driver installations and ask you whether you remove them or not. I have never had a problem updating drivers using the NVIDIA driver installation script.

Don’t know for sure does this has anything to do with the issues you’ve encountered, but if possible for you I’d strongly suggest upgrading your Linux installation too. Slamd64 12.0 is very, very old distribution, and now when an officially supported 64-bit Slackware branch is available, I see no reason to use anything else. Furthermore: various CUDA releases (always ones for latest RHEL version) worked for me perfectly with Slackware for years; at the moment, I’m using CUDA 3.0 beta (you could find the packages here) with Slackware64 13.0, and it works like charm.

Thanks you two for your answers! The installation of the new driver was no problem. It detected the old one and deleted it. That fixed the problem with CUDA and gave me about 20% extra bandwidth :)

@cgorac: I don’t yet see the point in upgrading a well running system. “Never touch a running system” ;)


Makes sense, that kind of decision is of course matter of personal taste… I mentioned it just as an alternative to try, and only because Slackware has rock-solid upgrade mechanism (I’m tracking -current for years, without any kind of issue), and there exist some goodies to pick up along the way (security fixes first of all, but then also upgrades for the development tools, and also some additions to the tool-chain from time to time - for example CMake is upgraded recently in -current, and now CMake CUDA module is included, etc.).