Hi,
Great basic tutorial - I think its evident from the forums that many new users need it.
A few suggestions though:
-. Personally I think a PDF/HTML is far better than PPT.
-. Explaination about emulation mode and the difference between emulation and release is something many new users fail to understand.
-. In the samples you put cudaMemcpy after the kernel invocation - many people fail to understand that cudaMemcpy will implicitly call
cudaThreadSynchronize and therefore you see code that call kernels and doesnt synchronize correctly. Maybe a description about
implict and explicit synchronization should be added as well. Page 29 talks about it, but there is no code sample showing how/where to use it.
-. Doubles vs floats - arch sm_XX is also something new users dont take into account.
-. More about why a kernel would fail and how to see whats causing it. People run kernels (which fail because of too many resources or
access violations) and think that after ten minutes of coding they’ve achieved a x1000 performance boost. Users should understand
how to check for errors. Page 30 address this a bit, I think it can be extended as this is one of the most common pitfalls of new users.
-. Some more info maybe on kernel resources: register pressure and how to see the kernel resource usage: --ptxas-options=“-v -mem”
-. Differences between shared memory and global memory - people think that to boost the application they simply need to use shared
memory instead of global memory. Sometimes people fail to understand that its not just a matter of choosing the memory to use
but you need to understand how to load data, sync it and use shared memory wisely in order to gain performance.
-. I would also suggest people to get familiar with threading issues on the CPU before coding the GPU. People who dont understand
CPU threads, synchronization issues, data dependency et al will never be able to use GPUs correctly.
-. Maybe add some “nVidia metodology” as to how to find the bottlenecks, debug (for example on windows without debugger), reduce
resource pressure and stuff like that. I know i’d like to hear what nVidia thinks :)
-. Maybe mention the dead-code optimizer. People sometimes dont understand that the kernel was optimized out and think that the kernel gave a x1000 boost.
I understand that some of those issues might add some more pages, but I think that those (along with what the document already addresses
and what Dominik wrote) are the most common issues and misunderstanding new users are facing.
my 1 cent,
eyal