Hello,
I created a simple C program to run on GPU. It works great when run via the .exe however, whenever i use nvprof it causes my system to become unstable (it freezes every x seconds) and requires a reboot. Any insight would be greatly appreciated.
I have played around and found:
1.) specifying 256 threads causes nvprof to NEVER return
2.) specifying 32 works but causes the system instability
3.) array size seems to have no affect
4.) memory/cpu of both cards are very low when profiling
I DO get this Warning when starting nvprof, Might this be the issue and how do i correct it? I have identical cards as shown in the info below:
==7596== NVPROF is profiling process 7596, command: .\whatup.exe
==7596== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: Programming Guide :: CUDA Toolkit Documentation
System Specs:
Windows 10 Pro x64
i7-6800K
64 GB DDR4
2 x EVGA 980Ti (SLI)
NVIDIA Drivers:
Display Driver: 390.65
Geforce Experience: 3.12.0.84
3D Vision Controller Driver: 390.41
3D Vision: 390.65
CUDA v9.1
Here is the program output:
Number of blocks: 1
Number of threads per block: 32
value 0.000000
value 2.000000
value 4.000000
value 6.000000
value 8.000000
value 10.000000
value 12.000000
value 14.000000
value 16.000000
value 18.000000
COMPLETED****** ← GPU process completes
Here is output from cudaGetDeviceProperties(…)
Device Count: 2
Device 0: GeForce GTX 980 Ti
Device 0, MaxThreadsPerBlock: 1024
Device 0, TotalGlobalMem: 6442450944
Device 0, SharedMemPerBlock: 49152
Device 0, Major: 5
Device 0, Minor: 2
Device 0, ClockRate: 1190000
Device 0, ECCEnabled: 0
Device 0, TccDriver: 0
Device 0, ComputeMode: 0
Device 1: GeForce GTX 980 Ti
Device 1, MaxThreadsPerBlock: 1024
Device 1, TotalGlobalMem: 6442450944
Device 1, SharedMemPerBlock: 49152
Device 1, Major: 5
Device 1, Minor: 2
Device 1, ClockRate: 1190000
Device 1, ECCEnabled: 0
Device 1, TccDriver: 0
Device 1, ComputeMode: 0
1.) SLI is enabled and working (at least via NVIDIA control panel, and EVGA Precision)
2.) Code runs fine on GPU, just cannot profile w/o it killing my machine and requiring reboot (otherwise the screen freezes every x seconds)
3.) I have tried a ‘clean’ install of nvidia drivers
… and finally, here is the C code which i compile via: nvcc myfile.cu -o thefinal.exe
__global__
void add_gpu(float *a, float *b, float *c, float n){
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for(int i=index; i<n; i+=stride) {
c[i] = a[i] + b[i];
}
}
void call_GPU() {
int SIZE = 1<<20;
float *a, *b, *c;
cudaMallocManaged(&a, SIZE * sizeof(float)); // wow, if this file is .c, this complains it needs 3rd param and if used it crashes!
cudaMallocManaged(&b, SIZE * sizeof(float));
cudaMallocManaged(&c, SIZE * sizeof(float));
for(int i=0; i<SIZE; i++) {
a[i] = i;
b[i] = i;
c[i] = 0;
}
int blocks = 1;
int threads_per_blocks = 32;
printf("Number of blocks: %d\n", blocks);
printf("Number of threads per block: %d\n", threads_per_blocks);
add_gpu<<<blocks, threads_per_blocks>>>(a, b, c, SIZE);
cudaDeviceSynchronize();
for(int i=0; i<10; i++) {
printf("value %f\n", c[i]);
}
cudaFree(a);
cudaFree(b);
cudaFree(c);
}
int main() {
printf("Ok, running on GPU\n");
call_GPU();
printf("COMPLETED******\n");
return 0;
}