Thank you for the post. Before reading it I tried yet another way to do the convolution.
Sadly there are quite some errors there, can you please take a look?
https://paste.ofcode.org/cBcUkhpxHFHGDAufRPg3dB
Errors are:
Error MSB3721 The command ““C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe” -gencode=arch=compute_50,code="sm_50,compute_50" --use-local-env --cl-version 2015 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64” -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include” -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler “/EHsc /W3 /nologo /O2 /FS /Zi /MD " -o x64\Release\kernel.cu.obj “D:\Licenta\CUDATest\CUDATest\kernel.cu”” exited with code 2. CUDATest C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\BuildCustomizations\CUDA 8.0.targets 689
Error no instance of overloaded function “GaussianBlur” matches the argument list CUDATest D:\Licenta\CUDATest\CUDATest\kernel.cu 305
Error (active) expected an expression CUDATest d:\Licenta\CUDATest\CUDATest\kernel.cu 305
Edit:
It was a typo issue. Now I got no output and it’s slower than before.
Even later edit:
It’s working now. But it’s slower. I tried to implement shared memory and got this:
cudaDeviceSynchronize returned error code 77
Which is related to this section:
// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching GaussianBlur!\n", cudaStatus);
goto Error;
}
This is the version of code with the “new” and “clean” convolution kernel with errors related to shared memory: https://paste.ofcode.org/QGpvRRUT6gYJdBD5hu8ysT
Sadly, this “clean” and “new” version is A LOT slower than my “messy” first version. The new one scores around 2500-3000ms. The old one scores around 650ms.
Here I was trying to implement shared memory on the old one: https://paste.ofcode.org/TnsKqEixDD4htjfHG2Qa2z
On the old one, the same error code 77 issue is present. Can someone help me to fix this thing?