Obvious Unanswered Question

I am new to cuda programming in Visual C++. In EVERY example of dot product that I have seen, e.g.,

https://www.nvidia.com/content/GTC-2010/pdfs/2131_GTC2010.pdf

, NOWHERE in the code is the summation variable, in the example, dev_c[0], EXPLICITLY initialized to 0.

The OBVIOUS question is where is it being initialized to zero and where is such zero initialization covered in cuda documentation?

Indeed, when I run dot the first time, it works, as expected, but, if I run it a second time, on the same dev_c, the new answer is added to the old one. So, to get the correct answer the second time, I use cudaMemset(dev_c, 0, sizeof(int)). But, this is a time expensive operation. Please, show me a faster way, maybe, within the code of dot.

It needs to be initialized to zero. Training decks such as that one occasionally have oversights like that.

As you’ve already suggested, if you add something like:

cudaMemset(dev_c, 0, sizeof(int));

to the code on slide 60, prior to the kernel launch, that should address the oversight. I don’t recommend trying to do it in the kernel itself, as it introduces another race condition.

It’s not covered in documentation (e.g. docs.nvidia.com), because documentation for CUDA primarily addresses the language, not specific algorithms or implementations.

I think if you study any of the reduction examples in the CUDA sample codes, you will find appropriate initializations, as needed.

It is not “occasionally.” Show me one example of the dot product that currently exists on the web where there is explicit initialization of the summation variable to zero. This “oversight” isn’t restricted to NVIDA employees.

I suspect that cudaMalloc is doing the initialization to zero while NVIDIA pretends that it is a coincidence. Prove that it is not a coincidence by showing me one example where cuaMalloc produces nonzero values.

Here’s an example:

https://stackoverflow.com/questions/32968071/cuda-dot-product

Here’s another example:

https://github.com/ugovaretto/cuda-training/blob/master/src/004_1_parallel-dot-product-atomics.cu

It also strikes me that if cudaMalloc is initializing data to zero, then your previously stated observation could not make sense:

After all, the exact same cudaMalloc operation is being invoked the first time you run the code, as well as the second, for dev_c

Whats different, of course, is the previous state of the memory. If the previous state of the memory matters, then cudaMalloc is not modifying the memory state.

  1. “After all, the exact same cudaMalloc operation is being invoked the first time you run the code, as well as the second, for dev_c”

I did not say or imply that cudaMalloc was invoked the second time I ran dot. In fact, I implied that I did not by saying “same dev_c.”

It took me a long time to edit this because the edit function didn’t work on my IE browser before I switched to Foxfire.

Sorry, I now see the initializations in 1. and 2.

The edit function is not working. I am going to try switching browsers.

I believe it is, and these are the lines that do it:

*c = 0;
   ...
   cudaMemcpy(dev_c, c, size, cudaMemcpyHostToDevice);

I believe it is, and this is the line that does it:

https://github.com/ugovaretto/cuda-training/blob/master/src/004_1_parallel-dot-product-atomics.cu#L128

Yes, I was confused here. When you said “the second time I ran dot” I was thinking you ran the program dot a second time. I assume now what you meant was you called the dot function twice in the same code.

Not correct. I’m not sure why you think that. However I can see that you’re not happy with my responses. So I’ll stop responding now. It’s OK if we disagree about things, you don’t have to take my word for anything.

My bad. You are correct on 1. and 2. I tried to correct myself before you saw the post but the edit function didn’t work on my IE browser.