I have few questions:
- Are CUDA kernels similar to a do loop? For examples, I want to perform 3 simple operations in 3 loops on an array of integer elements (this is just an example):
a. Add 1 to each element of array of integers
b. Tag the elements in the array which are even
c. Add one to the tagged elements
for the above operations, do I need to invoke 3 kernels? Will the pseudo code look somethign like:
Host allocate memory on the device call kernel (add one to all elements of the array) host call kernel (tag even elements) host call kernel (add 1 to the even elements of the array) host I/O to a file free memory on device end
Can we allocate and deallocate on the “global” memory on the device only or do we have access to the shared, register and local memory also. How do programmers normally use memory (global only or they also use local, shared and registers )? Does global memory has any latency issues? What do programmer normally do to avoid those latency issues?
For graphics, do we just copy the global memory data to the texture memory and then render it on screen or there are other efficient ways to deal with this?