Hi,
I’m teaching a course on parallel computation using GPUs. This year we’re moving the course from Linux to Windows. Most programs converted fine, but one of the programs that worked fine on Linux gives errors on Windows when the data set gets reasonably large.
I’ve pruned the program down to a minimal example that illustrates the behavior. It runs fine for 512x512 and 1024x1024 matrices, but 2048x2048 matrices result in “unspecified launch failure” from cudaGetLastError() after launching the kernel.
We’ve already learned about:
- setting LARGEADDRESSAWARE
- running the GPU as a computing device only (not used for graphics), so Windows does not time us out
- setting Visual Studio to the x64 build
A few more notes:
- Visual Studio 2017, build 15.8.9
- NVidia NSight Visual Studio Edition 6.0.0.18227
- GPU = Quadro K620
- This pruned-down example runs fine with a Release build; the error occurs only in the Debug/x64 build
Many of the lines of code will look quite irrelevan – but remember that this example is quite pruned down, and removing those lines of code caused the issue to vanish.
Here’s the code:
#include
#include
#include
using namespace std;
#include <cuda_runtime.h>
#define LOG(args) cout << args << endl
#define DIE(args) { cout << args << endl; exit(0); }
#define ERR_CHK(status, args) if (status != cudaSuccess) DIE (args << " (error code “<<cudaGetErrorString(status)<<”)")
const int BS=32;
global void mat_mult (float *dev_mem) {
shared float SA[BS][BS];
shared float SB[BS][BS];
float sum=0.0;
for (int kB=0; kB<64; ++kB) {
SA[0][0] = 0.0;
SB[0][0] = 0.0;
for (int kI=0; kI<BS; ++kI)
sum += (SA[0][0]) * (SB[0][0]);
}
dev_mem [0] = 0;
}
static void run (int N) {
LOG (endl<<“Working on “<<N<<“x”<<N<<” matrices.”);
int sizeBytes = 16;
float *dev_mem = NULL;
cudaError_t err = cudaMalloc((void **)&dev_mem, sizeBytes);
ERR_CHK (err, "Failed to allocate device mem");
int NBLK = N / BS; // # of blocks in any one dimension
dim3 grid (NBLK, NBLK), block(BS, BS);
mat_mult <<<grid, block>>> (dev_mem);
err = cudaGetLastError();
ERR_CHK (err, "Failed to launch/complete the mat_mult() kernel");
err = cudaDeviceSynchronize();
ERR_CHK (err, "Failed to synchronize");
err = cudaFree(dev_mem);
ERR_CHK (err, "Failed to free device memory");
LOG ("Success");
}
// Main() lives on the CPU.
int main() {
run (512);
run (1024);
run (2048);
run (4096);
return (0);
}