array elements limitation

Hello,

I’m testing CUDA. I wrote simple program to test the performance. It computes following algorithm:

  • create two single dimensional arrays in host memory(h_A,h_B)

  • increment every single element of array h_A and store it to the array h_B

  • then transfer h_A to the device array d_A

  • then transfer d_A to device array d_B

  • increment every single element of array d_B

  • then transfer d_B back to host h_A

  • then compare h_A and h_B(“for” loop) if result is correct

Now PROBLEM: if dimension of arrays is under 4194240 everything works fine. Above this limit is result incorrect.

It’s not memory limit problem(4194240*(32bit(sizeof(float)=32bit on my PC)))=134215680/1024=131070 bytes/1024=128MB(host and device have much more free memory)

Where I went wrong? >.<

Program attached.
arraylimit.txt (9 KB)

You probably are running out of resources somewhere (either memory or wallclock seconds). If you have the host thread spinlock in a cudaThreadSynchronize() call until your kernel finishes and then examine the return status, you will know for sure. I would guess it is the GPU driver watchdog timer killing your kernel for taking too long, but that is just a guess.

Thanks for advice avidday. External Image

I have included CUDA error handling to see if there is a problem with threat synchronization or kernel and I get this error “kernel invocation: invalid configuration argument”

so I changed number of threats per block to 512 and problem has disappeared.

I need to look closely on kernel sizing to get better idea what’s going on.

Thanks again.
arraylimit2.txt (9.41 KB)

There is a hardware limitation of 512 threads per block. That is why you kernel is failing to launch.

Yes I know about this limit. I had 64 threads per block and 5mil elements in array crashed the kernel. I raised number of threads to 512 and everything works great now.