simpleStreams FAILED

Hi,

Am fairly new to CUDA. I am doing concurrent kernel executions in my code. Until recently it was working fine. Then, it started giving weird results.

The code now gives correct values when I use 4 streams but fails for more than 4. When I run the cudaSimpleStreams sample code from SDK, it’s failing. The output of the sample code from SDK is :

[simpleStreams] starting...

[ simpleStreams ]

> > Using CUDA device [0]: GeForce GTX 560 Ti

Device: <GeForce GTX 560 Ti> canMapHostMemory: Yes

> CUDA Capable: SM 2.1 hardware

> 8 Multiprocessor(s) x 48 (Cores/Multiprocessor) = 384 (Cores)

> scale_factor = 1.0000

> array_size   = 16777216

> Using CPU/GPU Device Synchronization method (cudaDeviceScheduleBlockingSync)

> mmap() allocating 64.00 Mbytes (generic page-aligned system memory)

> cudaHostRegister() registering 64.00 Mbytes of generic allocated system memory

Starting Test

memcopy:	12.14

kernel:		46.80

non-streamed:	58.85 (58.93 expected)

8 streams:	49.29 (48.31 expected with compute capability 1.1 or later)

-------------------------------

21181: 4744 5000

[simpleStreams] test results...

FAILED

Press ENTER to exit...

Card is GeForce GTX 560 Ti. Am running CUDA 4.0 on Ubuntu 10.04. I will be grateful for your responses.

Thanks