I have two desktop machines (GTX 260 & 9500 GT) to work with CUDA under Linux (numerical simulations), but very often i need to work on the way, so i’ve decided to buy a netbook - one of new machines with ION. HP mini 311, particularly. So, the questions is: does CUDA on ION/ION LE works smooth under Linux (OpenSUSE 11.x)?
Yes, it does work. Here is the SDK deviceQuery output for the Geforce 8200, the ION precursor for AMD boards, on a CentOS 5.4 GPU compute node:
[root@bdgpu-n01 ~]# /usr/local/cuda_sdk/C/bin/linux/release/deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
There are 2 devices supporting CUDA
Device 0: "GeForce GTX 275"
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 939261952 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.48 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Device 1: "GeForce 8200"
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 265617408 bytes
Number of multiprocessors: 1
Number of cores: 8
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.20 GHz
Concurrent copy and execution: No
Run time limit on kernels: No
Integrated: Yes
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Test PASSED
Press ENTER to exit...
By default, CUDA apps pick the GTX 275.
And here are the SDK nbody -benchmark numbers (a whopping 11.543 GFLOP/s !):
[root@bdgpu-n01 ~]# /usr/local/cuda_sdk/C/bin/linux/release/nbody -benchmark --device=1
Run "nbody -benchmark [-n=<numBodies>]" to measure perfomance.
Using device 1: GeForce 8200
1024 bodies, total time for 100 iterations: 181.675 ms
= 0.577 billion interactions per second
= 11.543 GFLOP/s at 20 flops per interaction
Also, here are the 8200 bandwidthTest numbers (with those of the GTX 275 for comparison):
[root@bdgpu-n01 ~]# /usr/local/cuda_sdk/C/bin/linux/release/bandwidthTest --memory=pinned --device=1
Running on......
device 1:GeForce 8200
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2261.8
Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2259.9
Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4327.3
&&&& Test PASSED
Press ENTER to exit...
[root@bdgpu-n01 ~]# /usr/local/cuda_sdk/C/bin/linux/release/bandwidthTest --memory=pinned --device=0
Running on......
device 0:GeForce GTX 275
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2714.1
Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2864.6
Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 107162.8
&&&& Test PASSED
Press ENTER to exit...
The internal memory bandwidth of 4.3 GB/s is, shall we say, on the low side for a CUDA-capable device, but the host-to-device and device-to-host rates (pinned memory) are almost the same as for the GTX275. (Although off-topic, it should be noted that these are 2U rackmount machines and that the GTX275s are connected with a flexible riser. The device-to-host number doubles when the GT200 GPU is either directly mounted in the PCIex16 2.0 slot or connected with a rigid riser. Unfortunately the flexible riser was necessary to fit the GPU in the 2U box.) Anyway, the ION is compute capability 1.1. Have fun!
I hesitate to call the performance of the Geforce 8200 (or any other IGP) “good”. If asked to choose, I would probably go with “bad”, but maybe that is unfair since there are probably whole classes of CUDA applications (yet unwritten?) that would be a good match for the IGP.
I am not an expert on IONs, however the lower powered version of the ION LE has the same number of cores (8) as the 8200, and appears to have equivalent specs. YMMV
There are a lot of real time signal processing applications that manage to fit in a single MP quite fine (and a standard ION (not LE) has 2…), and aren’t bandwidth limited… in these cases you’re purely clock limited, and the clock of an ION is quite close to any high end geforce card - in these cases the ION is pretty damn fine (2-8x the FLOPs of various low power VIA/Atom/ARM processors)…
Oh, and then you have tricks for the ION like dual-purpose kernels, which execute different code sets (‘virtual kernels’) on each MP, effectively getting a poor mans asynchronous kernel execution (with varying kernels) over the 2 MPs)…
So, my HP Mini 311 is here. CUDA works perfect (but slow, of course:)) under openSUSE 11.2. Here is deviceQuery output:
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: "ION LE"
CUDA Driver Version: 3.0
CUDA Runtime Version: 3.0
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 131792896 bytes
Number of multiprocessors: 2
Number of cores: 16
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.10 GHz
Concurrent copy and execution: No
Run time limit on kernels: Yes
Integrated: Yes
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 134575159, CUDA Runtime Version = 3.0, NumDevs = 1, Device = ION LE
Default compiler in 11.2 is GCC 4.4, which is not supported. Thus, to get it working, do the following: