A small oddity I have not found an explanation to anywhere in the manuals or other C++ references.
I tried the sample provided immediately given when creating a new project in CUDA 8.0.
This worked fine.
I added a variable to host memory copied and pasted with a modification the source given in all required spaces and added it to the addWithCuda sample existing. It now looks like this.
Note that a and b are constant vectors, c is the original example result vector and x is my goofing-around vector with one element.
cudaError_t addWithCuda(int *a, int *b, int *c, unsigned int size, int *x);
Result: Nothing is returned in x, incrementing x on device has no effect.
Modifying it further;
cudaError_t addWithCuda(int *x, int *a, int *b, int *c, unsigned int size);
Result: x is returned properly through cudaMemcpy as expected.
Incrementing x on device still has no effect except in the final stages of
the last two elements in the called kernel code. Normally the last few threads.
Is there some specific limitations or preference to;
A) The order of passing in parameters?
B) Which kernel may work on global or shared memory and not?
(Other than blocks access to all subsequent threads shared memory in one SM)
I could do this as x& however if *x works in some cases but not all, this seems a bit obscure to me.
After all the kernel does not rewrite it self in between the tested 15 threads working with it.