Hi,
Am fairly new to CUDA. I am doing concurrent kernel executions in my code. Until recently it was working fine. Then, it started giving weird results.
The code now gives correct values when I use 4 streams but fails for more than 4. When I run the cudaSimpleStreams sample code from SDK, it’s failing. The output of the sample code from SDK is :
[simpleStreams] starting...
[ simpleStreams ]
> > Using CUDA device [0]: GeForce GTX 560 Ti
Device: <GeForce GTX 560 Ti> canMapHostMemory: Yes
> CUDA Capable: SM 2.1 hardware
> 8 Multiprocessor(s) x 48 (Cores/Multiprocessor) = 384 (Cores)
> scale_factor = 1.0000
> array_size = 16777216
> Using CPU/GPU Device Synchronization method (cudaDeviceScheduleBlockingSync)
> mmap() allocating 64.00 Mbytes (generic page-aligned system memory)
> cudaHostRegister() registering 64.00 Mbytes of generic allocated system memory
Starting Test
memcopy: 12.14
kernel: 46.80
non-streamed: 58.85 (58.93 expected)
8 streams: 49.29 (48.31 expected with compute capability 1.1 or later)
-------------------------------
21181: 4744 5000
[simpleStreams] test results...
FAILED
Press ENTER to exit...
Card is GeForce GTX 560 Ti. Am running CUDA 4.0 on Ubuntu 10.04. I will be grateful for your responses.
Thanks