I have a piece of C# code that I’ve parallelized using CUDA. It does some stuff with matrix vector multiplication and prints out the results. I then have a little program to compare the CPU results to the GPU results.
Now, I’ve got several generated matrices that I am testing this with. When I load the matrices in the following order, the last set of results does not match: 0, 1, 2, 3. However, if I load the matrices in in a different order (specifically, 2,1,0,3), all of the results match.
As far as I can tell, I am freeing all of the memory being used by cuda and C - but I wonder is there some sort of a function that can “reset” my GPU to make sure there is nothing hanging around? I should also mention that each run is completely separate from the other runs - meaning, I run the program which loads the matrix, does it’s cuda magic then exists, then loads the program again and the next matrix etc.
Any insight would be greatly appreciated.