CUDA on Nvidia ION under Linux Does it work?

I have two desktop machines (GTX 260 & 9500 GT) to work with CUDA under Linux (numerical simulations), but very often i need to work on the way, so i’ve decided to buy a netbook - one of new machines with ION. HP mini 311, particularly. So, the questions is: does CUDA on ION/ION LE works smooth under Linux (OpenSUSE 11.x)?

Any useful info is appreciated:)

Yes, it does work. Here is the SDK deviceQuery output for the Geforce 8200, the ION precursor for AMD boards, on a CentOS 5.4 GPU compute node:

[root@bdgpu-n01 ~]# /usr/local/cuda_sdk/C/bin/linux/release/deviceQuery

CUDA Device Query (Runtime API) version (CUDART static linking)

There are 2 devices supporting CUDA

Device 0: "GeForce GTX 275"

  CUDA Driver Version:						   2.30

  CUDA Runtime Version:						  2.30

  CUDA Capability Major revision number:		 1

  CUDA Capability Minor revision number:		 3

  Total amount of global memory:				 939261952 bytes

  Number of multiprocessors:					 30

  Number of cores:							   240

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 16384

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									1.48 GHz

  Concurrent copy and execution:				 Yes

  Run time limit on kernels:					 No

  Integrated:									No

  Support host page-locked memory mapping:	   Yes

  Compute mode:								  Default (multiple host threads	can use this device simultaneously)

Device 1: "GeForce 8200"

  CUDA Driver Version:						   2.30

  CUDA Runtime Version:						  2.30

  CUDA Capability Major revision number:		 1

  CUDA Capability Minor revision number:		 1

  Total amount of global memory:				 265617408 bytes

  Number of multiprocessors:					 1

  Number of cores:							   8

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									1.20 GHz

  Concurrent copy and execution:				 No

  Run time limit on kernels:					 No

  Integrated:									Yes

  Support host page-locked memory mapping:	   Yes

  Compute mode:								  Default (multiple host threads	can use this device simultaneously)

Test PASSED

Press ENTER to exit...

By default, CUDA apps pick the GTX 275.

And here are the SDK nbody -benchmark numbers (a whopping 11.543 GFLOP/s !):

[root@bdgpu-n01 ~]# /usr/local/cuda_sdk/C/bin/linux/release/nbody -benchmark --device=1

Run "nbody -benchmark [-n=<numBodies>]" to measure perfomance.

Using device 1: GeForce 8200

1024 bodies, total time for 100 iterations: 181.675 ms

= 0.577 billion interactions per second

= 11.543 GFLOP/s at 20 flops per interaction

Also, here are the 8200 bandwidthTest numbers (with those of the GTX 275 for comparison):

[root@bdgpu-n01 ~]# /usr/local/cuda_sdk/C/bin/linux/release/bandwidthTest --memory=pinned --device=1

Running on......

	  device 1:GeForce 8200

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   2261.8

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   2259.9

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   4327.3

&&&& Test PASSED

Press ENTER to exit...

[root@bdgpu-n01 ~]# /usr/local/cuda_sdk/C/bin/linux/release/bandwidthTest --memory=pinned --device=0

Running on......

	  device 0:GeForce GTX 275

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   2714.1

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   2864.6

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   107162.8

&&&& Test PASSED

Press ENTER to exit...

The internal memory bandwidth of 4.3 GB/s is, shall we say, on the low side for a CUDA-capable device, but the host-to-device and device-to-host rates (pinned memory) are almost the same as for the GTX275. (Although off-topic, it should be noted that these are 2U rackmount machines and that the GTX275s are connected with a flexible riser. The device-to-host number doubles when the GT200 GPU is either directly mounted in the PCIex16 2.0 slot or connected with a rigid riser. Unfortunately the flexible riser was necessary to fit the GPU in the 2U box.) Anyway, the ION is compute capability 1.1. Have fun!

Then rephrase your question to cover ION2 instead, which - as far as I can see - is also compute1.2 rather than 1.1

http://tech.icrontic.com/news/nvidia-unveils-ion-2-platform/

Thank a lot! I am, actually, not sure, that 9400M is good because 8200 was good, but that the argument, nevertheless.

The only thing i do not understand - the differences between ION and ION LE. Is it right, that when working under Linux - there is no differences?

I hesitate to call the performance of the Geforce 8200 (or any other IGP) “good”. If asked to choose, I would probably go with “bad”, but maybe that is unfair since there are probably whole classes of CUDA applications (yet unwritten?) that would be a good match for the IGP.

I am not an expert on IONs, however the lower powered version of the ION LE has the same number of cores (8) as the 8200, and appears to have equivalent specs. YMMV

There are a lot of real time signal processing applications that manage to fit in a single MP quite fine (and a standard ION (not LE) has 2…), and aren’t bandwidth limited… in these cases you’re purely clock limited, and the clock of an ION is quite close to any high end geforce card - in these cases the ION is pretty damn fine (2-8x the FLOPs of various low power VIA/Atom/ARM processors)…

Oh, and then you have tricks for the ION like dual-purpose kernels, which execute different code sets (‘virtual kernels’) on each MP, effectively getting a poor mans asynchronous kernel execution (with varying kernels) over the 2 MPs)…

I think this is a good point. As you imply, the performance of Atom/VIA/ARM processors is also “bad” in an HPC context, but not all computing is HPC.

So, my HP Mini 311 is here. CUDA works perfect (but slow, of course:)) under openSUSE 11.2. Here is deviceQuery output:

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "ION LE"

  CUDA Driver Version:						   3.0

  CUDA Runtime Version:						  3.0

  CUDA Capability Major revision number:		 1

  CUDA Capability Minor revision number:		 1

  Total amount of global memory:				 131792896 bytes

  Number of multiprocessors:					 2

  Number of cores:							   16

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  2147483647 bytes

  Texture alignment:							 256 bytes

  Clock rate:									1.10 GHz

  Concurrent copy and execution:				 No

  Run time limit on kernels:					 Yes

  Integrated:									Yes

  Support host page-locked memory mapping:	   Yes

  Compute mode:								  Default (multiple host threads can use this device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 134575159, CUDA Runtime Version = 3.0, NumDevs = 1, Device = ION LE

Default compiler in 11.2 is GCC 4.4, which is not supported. Thus, to get it working, do the following:

  1. install GCC 4.3

  2. mkdir /opt/gcc43; ln -s /usr/bin/gcc-4.3 /opt/gcc43/gcc; ln -s /usr/bin/g+±4.3 /opt/gcc43/g++

  3. in nvcc.profile add line “compiler-bindir=/opt/gcc43”