Kernel computation on multiple array elements

you explain that one thread can work on as many input elements as required.

I guess I wasn’t clear enough. Let me try again: A CUDA thread can work on as many input elements and as many output elements as the programmer desires. There are no limitations. Above I just mentioned a common arrangement that many programmers choose to use because it is often advantageous: N input elements per thread, 1 output element per thread.

1 Like