Hi, I’m beginner in CUDA and I have some questions; hope for help :)
(name of the topic “Number of Blocks” is associated with the last question)
I have laptop with Nvidia Geforce 310M. OS is windows 7 x64.
Sometimes, when I do not write the program correctly, when I run it blue screen shows up and system restarts. I don’t know what is the problem. Well, I observed and found out that it most often happens when I use big amount of memory. For example lets take a look at my program, which adds 2 vectors to each other. For now, do not view at kernel(), I run kernel1() instead of it.
//Summing up two arrays in GPU.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cstdio>
#include <cmath>
#define n 10000000 // length of arrays
int *dev_a,*dev_b,*dev_c; // pointers to device memory
int i,R=0,x,y;
int *a,*b,*c; // pointers to host memory
int d[n]; // array for checking the answer
__global__ void kernel(int *a,int *b,int *c)
{
int i = (n-1)/blockDim.x+1;
int j = i * (threadIdx.x+1);
i *= threadIdx.x;
if(j>n) j=n;
for(;i<j;i++)
c[i]=a[i]+b[i];
}
__global__ void kernel1(int *a,int *b,int *c)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if(i<n)
c[i]=a[i]+b[i];
}
// this function is much faster than standard rand() function, so I used it :)
int Rand(){return (R=R*1000000007+872345641);}
int main()
{
// allocation of host memory. This memory will be faster to copy into device.
cudaHostAlloc((void**)&a,n*sizeof(int),cudaHostAllocPortable | cudaHostAllocWriteCombined);
cudaHostAlloc((void**)&b,n*sizeof(int),cudaHostAllocPortable | cudaHostAllocWriteCombined);
cudaHostAlloc((void**)&c,n*sizeof(int),cudaHostAllocPortable);
// write some random values and calculate answer in advance.
for(i=0;i<n;i++)
{
x = Rand();
y = Rand();
a[i]=x;
b[i]=y;
d[i]=x+y;
}
// allocation of device memory
cudaMalloc((void**)&dev_a,n*sizeof(int));
cudaMalloc((void**)&dev_b,n*sizeof(int));
cudaMalloc((void**)&dev_c,n*sizeof(int));
// copy memory from host to device
cudaMemcpy(dev_a,a,n*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b,b,n*sizeof(int),cudaMemcpyHostToDevice);
// launch kernel
// kernel<<<1,512>>>(dev_a,dev_b,dev_c);
kernel1<<<n/512+1,512>>>(dev_a,dev_b,dev_c);
// wait till kernel is done
cudaDeviceSynchronize();
// copy memory from device to host
cudaMemcpy(c,dev_c,n*sizeof(int),cudaMemcpyDeviceToHost);
// check for errors
for(i=0;i<n;i++)
if(c[i]!=d[i])
{
printf("Error i=%d\n",i);
return 0;
}
// free allocated memory on host and device
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
cudaFreeHost(a);
cudaFreeHost(b);
cudaFreeHost(c);
return 0;
}
I run it and it doesn’t have any problems. I ran it on visual profiler, no problems here, too. However, when I changed
cudaHostAlloc((void**)&a,n*sizeof(int),cudaHostAllocPortable | cudaHostAllocWriteCombined);
cudaHostAlloc((void**)&b,n*sizeof(int),cudaHostAllocPortable | cudaHostAllocWriteCombined);
this, into this
cudaHostAlloc((void**)&a,n*sizeof(int),cudaHostAllocPortable);
cudaHostAlloc((void**)&b,n*sizeof(int),cudaHostAllocPortable);
And ran it on visual profiler, at the beginning it ran without problems, but on the 5th run (it runs 7 times the same program) the screen became partially black, then parts of it showed up and wrote that driver recovered successfully, than it became black again and at last blue screen showed up and restarted laptop. If I run the program once, there are no errors. And sometimes this blue screen showed up even when I ran it from visual studio. Can anyone help me to find solution? I read somewhere that the GPU becomes overheated, is it the problem?
I have server which has GTX 580. I ran different program in the past which was not right. The system on server (windows server 2008) just didn’t react on anything I did. Mouse didn’t move, nothing changed on screen no matter what I did so I restarted it. Can you tell me what is the problem? How can I correct it?
Okay, now the last question.
I run “Analysis Activity” from VS2010 on the server. when I run this program with kernel(), the kernel itself runs for 0.103813 seconds. When I run kernel1() it runs 0.000693 seconds. It’s 150x faster! The only difference between them is that in kernel() I run 1 block and 512 threads, then each thread takes a 1/512th portion of whole array and sums it up. In kernel1() each thread in each block sums 1 element and I run it on big amount of blocks. Can anyone explain why there is difference and why the difference is so much? I didn’t expect there would be difference at all. Does several blocks run together on GPU? I think only 1 block runs because there is one core. And if several block runs together, why there is difference? Can you give me some advices, how to maximize global memory throughput?