CUDA on iMac with NVIDIA GeForce 9400 Successful and Failed Tests

The following is the Terminal output for the few tests that I ran against the CUDA compiler which failed after some fashion or another. Most of the failures are outright failures, basically Runtime API errors, although the first failure is one of not being able to allocate 351.5625 Mbytes of GPU memory.

Most of the CUDA tests run without a hitch, only these tests fail in anyway at all.

I was wondering how common this type of failure was on an nvidia GeForce 9400 with 253.6875MB of GPU Memory.

Any known explanation ?

[codebox]cyberos-imac:release cybero$ ./3dfd

3DFD running on: GeForce 9400

Total GPU Memory: 253.6875 MB

480x480x400

Unable to allocate 351.5625 Mbytes of GPU memory

TEST PASSED!

cyberos-imac:release cybero$ ./BlackScholes

Initializing data…

…allocating CPU memory for options.

…allocating GPU memory for options.

cudaSafeCall() Runtime API error in file <BlackScholes.cu>, line 134 : out of memory.

cyberos-imac:release cybero$ ./MersenneTwister

Initializing data for 24000000 samples…

cudaSafeCall() Runtime API error in file <MersenneTwister.cu>, line 110 : out of memory.

cyberos-imac:release cybero$ ./asyncAPI

cudaSafeCall() Runtime API error in file <asyncAPI.cu>, line 60 : out of memory.

cyberos-imac:release cybero$ ./bandwidthTest

Running on…

  device 0:GeForce 9400

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 352.5

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1483.5

Quick Mode

Device to Device Bandwidth

cudaSafeCall() Runtime API error in file <bandwidthTest.cu>, line 760 : out of memory.

cyberos-imac:release cybero$ ./fastWalshTransform

Initializing data…

…allocating CPU memory

…allocating GPU memory

cudaSafeCall() Runtime API error in file <fastWalshTransform.cu>, line 111 : out of memory.

cyberos-imac:release cybero$ ./histogram

Initializing data…

…allocating CPU memory.

…generating input data

…allocating GPU memory and copying input data

cudaSafeCall() Runtime API error in file <main.cpp>, line 64 : out of memory.

cyberos-imac:release cybero$ ./simpleMultiGPU

CUDA-capable device count: 1

main(): generating input data…

main(): waiting for GPU results…

cudaSafeCall() Runtime API error in file <simpleMultiGPU.cpp>, line 57 : out of memory.

cyberos-imac:release cybero$ ./simpleStreams

[ simpleStreams ]

Device name : GeForce 9400

CUDA Capable SM 1.1 hardware with 2 multi-processors

scale_factor = 0.5000

array_size = 8388608

cudaSafeCall() Runtime API error in file <simpleStreams.cu>, line 125 : out of memory.

[/codebox]

Please find attached the full Terminal session [including PATH export - how exciting •~]
Terminal_Saved_OutputCUDA_tests.txt.zip (8.41 KB)

I am seeing the similar results, including an error when just trying to allocate memory on the device. Also, output from the ‘deviceQuery’ application in the SDK suggests that the card is capable of running CUDA applications; see log below.

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "GeForce 9400"

  CUDA Driver Version:						   2.30

  CUDA Runtime Version:						  2.30

  CUDA Capability Major revision number:		 1

  CUDA Capability Minor revision number:		 1

  Total amount of global memory:				 131792896 bytes

  Number of multiprocessors:					 2

  Number of cores:							   16

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									0.80 GHz

  Concurrent copy and execution:				 No

  Run time limit on kernels:					 Yes

  Integrated:									Yes

  Support host page-locked memory mapping:	   Yes

  Compute mode:								  Default (multiple host threads can use this device simultaneously)

Test PASSED

Same problem for me, with the same binaries.
Just after installing the 2.3.0 driver.

Douglas Aguiar.

Same problem. Any help from the NVIDIA folks please. Attempting to run bandwidthTest on OSX 10.5.8 MacBookPro.

Card Info:

NVIDIA GeForce 9400M:

Chipset Model: NVIDIA GeForce 9400M
Type: Display
Bus: PCI
VRAM (Total): 256 MB
Vendor: NVIDIA (0x10de)
Device ID: 0x0863
Revision ID: 0x00b1
ROM Revision: 3448
gMux Version: 1.8.8

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: “GeForce 9400M”
CUDA Driver Version: 3.0
CUDA Runtime Version: 3.0
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 266010624 bytes
Number of multiprocessors: 2
Number of cores: 16
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.10 GHz
Concurrent copy and execution: No
Run time limit on kernels: Yes
Integrated: Yes
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 53331, CUDA Runtime Version = 3.0, NumDevs = 1, Device = GeForce 9400M

[bandwidthTest]
./bandwidthTest Starting…

Running on…

Device 0: GeForce 9400M
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1641.9

Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1154.7

bandwidthTest.cu(720) : cudaSafeCall() Runtime API error : out of memory.

I think this is just a matter of Quartz memory usage. A GPU with an active display doesn’t have as much free memory as a dedicated GPU, and establishing a CUDA context uses something of the order of 40-50Mb of device memory. On a 256Mb card with a display attached, that starts to get rather tight rather quickly.

If you build and run this, it will give you an idea of how much device memory your CUDA programs actually have at their disposal:

#include <cuda.h>

#ifndef gpuAssert

#include <stdio.h>

#define gpuAssert( condition ) {if( (condition) != 0 ) { fprintf( stderr, "\n FAILURE %d in %s, line %d\n", condition, __FILE__, __LINE__ );exit( 1 );}}

#endif

#define maxstring (128)

#define actionstringlen (16)

#define constMb   (1048576)

int main(void)

{

	/* GPU data */

	int				  nDevices;

	int				  deviceNumber;

	CUdevice			 deviceHandle;

	CUcontext			context;

	char				 deviceName[maxstring];

	int				  deviceCC[2];

	int				  deviceCompMode;

	unsigned int		 deviceMemoryTot;

	CUdevprop			deviceProps;

	unsigned int		 memReserved;

	unsigned int		 memTotal;

	char compModeString[maxstring];

	gpuAssert( cuInit(0) );

	gpuAssert( cuDeviceGetCount(&nDevices) );

	for(deviceNumber=0; deviceNumber<nDevices; deviceNumber++) {

		gpuAssert( cuDeviceGet(&deviceHandle, deviceNumber) );

		gpuAssert( cuDeviceGetName(deviceName, maxstring, deviceHandle) );

		gpuAssert( cuDeviceGetProperties(&deviceProps, deviceHandle) );

		gpuAssert( cuDeviceTotalMem(&deviceMemoryTot, deviceHandle) );

		gpuAssert( cuDeviceGetAttribute(&deviceCompMode, CU_DEVICE_ATTRIBUTE_COMPUTE_MODE, deviceHandle) );

		gpuAssert( cuDeviceComputeCapability(&deviceCC[0], &deviceCC[1], deviceHandle) );

		switch (deviceCompMode) {

		case CU_COMPUTEMODE_PROHIBITED:

			sprintf(compModeString,"Compute Prohibited mode");

			break;

		case CU_COMPUTEMODE_DEFAULT:

			sprintf(compModeString, "Normal mode");

			break;

		case CU_COMPUTEMODE_EXCLUSIVE:

			sprintf(compModeString, "Compute Exclusive mode");

			break;

		default:

			sprintf(compModeString, "Unknown");

			break;

		}

		fprintf(stdout, "\n%d %s, %d MHz, %d Mb, Compute Capability %d.%d, %s\n", 

				deviceNumber, deviceName, deviceProps.clockRate/1000,

				deviceMemoryTot / constMb, deviceCC[0], deviceCC[1], compModeString);

		if ( cuCtxCreate(&context, CU_CTX_SCHED_AUTO, deviceHandle) == CUDA_SUCCESS ) {

			CUdeviceptr memPool;

			gpuAssert( cuMemGetInfo( &memReserved, &memTotal ) );

			while( cuMemAlloc( &memPool, memReserved ) != CUDA_SUCCESS )

			{

				memReserved -= constMb;

				if ( memReserved < constMb ) break;

			}

			fprintf(stdout, "Successfully allocated %d Mb memory on device\n", memReserved/constMb);

			cuMemFree(memPool);

			cuCtxDestroy(context);

		}

	}

	return 0;

}

This is what it does on Linux, building it on Snow Leopard should be pretty much the same:

avid@cuda:~$ gcc -g memfree.c -I$CUDA_INSTALL_PATH/include -L$CUDA_INSTALL_PATH/lib64 -lcuda

avid@cuda:~$ ./a.out 

0 GeForce GTX 275, 1460 MHz, 895 Mb, Compute Capability 1.3, Normal mode

Successfully allocated 773 Mb memory on device

1 GeForce GTX 275, 1460 MHz, 895 Mb, Compute Capability 1.3, Normal mode

Successfully allocated 858 Mb memory on device

Thanks for your quick reply. Woke and realized that I prob shouldn’t be running a GPU memory test while a windows VM is running in the background. mea culpa. thanks again.