non-streamed and 4 streamed much lower than expected

Hello all CUDA experts!

Today my GeForce GTX 295 arrived and we have just started to get to know each other.

I am new in CUDA and have just been running Device Query and simpleStreams and I am a bit cunfused over that my results are much lower than expected when running simpleStreams(red text). What can be wrong and is it fixable? Are both my GPUs working? Any help is appreciated.

These are my outputs:

Device Query:


There is 1 device supporting CUDA

Device 0: “GeForce GTX 2295”

Major revision number: 1

Minor revision number: 3

Total amount of global memory: 939196416 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.24 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: Yes

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default

Test PASSED

Press ENTER to exit…

simpleStreams:


running on: GeForce GTX 295

memcopy: 27.51

kernel: 214.54

non-streamed: 124.47 (242.05 expected)

4 streams: 26.42 (221.42 expected with compute capability 1.1 or later)


Test PASSED

Press ENTER to exit…