Number of Blocks

Hi, I’m beginner in CUDA and I have some questions; hope for help :)

(name of the topic “Number of Blocks” is associated with the last question)

I have laptop with Nvidia Geforce 310M. OS is windows 7 x64.

Sometimes, when I do not write the program correctly, when I run it blue screen shows up and system restarts. I don’t know what is the problem. Well, I observed and found out that it most often happens when I use big amount of memory. For example lets take a look at my program, which adds 2 vectors to each other. For now, do not view at kernel(), I run kernel1() instead of it.

//Summing up two arrays in GPU.

#include "cuda_runtime.h"

#include "device_launch_parameters.h"

#include <cstdio>

#include <cmath>

#define n 10000000			// length of arrays

int *dev_a,*dev_b,*dev_c;	// pointers to device memory

int i,R=0,x,y;

int *a,*b,*c;				// pointers to host memory

int d[n];					// array for checking the answer

__global__ void kernel(int *a,int *b,int *c)

{

	int i = (n-1)/blockDim.x+1;

	int j = i * (threadIdx.x+1);

	i *= threadIdx.x;

	if(j>n) j=n;

	for(;i<j;i++)

		c[i]=a[i]+b[i];

}

__global__ void kernel1(int *a,int *b,int *c)

{

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	if(i<n)

		c[i]=a[i]+b[i];

}

// this function is much faster than standard rand() function, so I used it :)

int Rand(){return (R=R*1000000007+872345641);}

int main()

{

	// allocation of host memory. This memory will be faster to copy into device.

	cudaHostAlloc((void**)&a,n*sizeof(int),cudaHostAllocPortable | cudaHostAllocWriteCombined);

	cudaHostAlloc((void**)&b,n*sizeof(int),cudaHostAllocPortable | cudaHostAllocWriteCombined);

	cudaHostAlloc((void**)&c,n*sizeof(int),cudaHostAllocPortable);

	// write some random values and calculate answer in advance.

	for(i=0;i<n;i++)

	{

		x = Rand();

		y = Rand();

		a[i]=x;

		b[i]=y;

		d[i]=x+y;

	}

	// allocation of device memory

	cudaMalloc((void**)&dev_a,n*sizeof(int));

	cudaMalloc((void**)&dev_b,n*sizeof(int));

	cudaMalloc((void**)&dev_c,n*sizeof(int));

	// copy memory from host to device

	cudaMemcpy(dev_a,a,n*sizeof(int),cudaMemcpyHostToDevice);

	cudaMemcpy(dev_b,b,n*sizeof(int),cudaMemcpyHostToDevice);

	// launch kernel

//	kernel<<<1,512>>>(dev_a,dev_b,dev_c);

	kernel1<<<n/512+1,512>>>(dev_a,dev_b,dev_c);

	// wait till kernel is done

	cudaDeviceSynchronize();

	// copy memory from device to host

	cudaMemcpy(c,dev_c,n*sizeof(int),cudaMemcpyDeviceToHost);

	// check for errors

	for(i=0;i<n;i++)

		if(c[i]!=d[i])

		{

			printf("Error i=%d\n",i);

			return 0;

		}

	// free allocated memory on host and device

	cudaFree(dev_a);

	cudaFree(dev_b);

	cudaFree(dev_c);

	cudaFreeHost(a);

	cudaFreeHost(b);

	cudaFreeHost(c);

	return 0;

}

I run it and it doesn’t have any problems. I ran it on visual profiler, no problems here, too. However, when I changed

cudaHostAlloc((void**)&a,n*sizeof(int),cudaHostAllocPortable | cudaHostAllocWriteCombined);

	cudaHostAlloc((void**)&b,n*sizeof(int),cudaHostAllocPortable | cudaHostAllocWriteCombined);

this, into this

cudaHostAlloc((void**)&a,n*sizeof(int),cudaHostAllocPortable);

	cudaHostAlloc((void**)&b,n*sizeof(int),cudaHostAllocPortable);

And ran it on visual profiler, at the beginning it ran without problems, but on the 5th run (it runs 7 times the same program) the screen became partially black, then parts of it showed up and wrote that driver recovered successfully, than it became black again and at last blue screen showed up and restarted laptop. If I run the program once, there are no errors. And sometimes this blue screen showed up even when I ran it from visual studio. Can anyone help me to find solution? I read somewhere that the GPU becomes overheated, is it the problem?

I have server which has GTX 580. I ran different program in the past which was not right. The system on server (windows server 2008) just didn’t react on anything I did. Mouse didn’t move, nothing changed on screen no matter what I did so I restarted it. Can you tell me what is the problem? How can I correct it?

Okay, now the last question.

I run “Analysis Activity” from VS2010 on the server. when I run this program with kernel(), the kernel itself runs for 0.103813 seconds. When I run kernel1() it runs 0.000693 seconds. It’s 150x faster! The only difference between them is that in kernel() I run 1 block and 512 threads, then each thread takes a 1/512th portion of whole array and sums it up. In kernel1() each thread in each block sums 1 element and I run it on big amount of blocks. Can anyone explain why there is difference and why the difference is so much? I didn’t expect there would be difference at all. Does several blocks run together on GPU? I think only 1 block runs because there is one core. And if several block runs together, why there is difference? Can you give me some advices, how to maximize global memory throughput?

Did anyone actually read it?

I found out that multiple blocks run at the same time on GPU. And the drastic speedup probably happened because of memory coalescing, too. (Even though, I would like to know how blocks are run on GPU in details). Now, the only problem is the BLUE SCREEN. Can anyone help?

Yes, multiple blocks, multiple threads per block. If you want an overview of how things happen, have a look at the CUDA programming guid. Honestly, I bought a book on CUDA but found the programming guide to be of more use to me. Behold:

http://developer.download.nvidia.com/compute/cuda/4_0/toolkit/docs/CUDA_C_Programming_Guide.pdf

I’m working on something similar in spirit, on thing I can spot is this line:

int i = blockIdx.x * blockDim.x + threadIdx.x;

        if(i<n)

                c[i]=a[i]+b[i];

You’re running 512 threads on each block, which is fine, but because you use:

kernel1<<<n/512+1,512>>>

You will overrun your buffer unless n is perfectly divisible by 512, in which case you don’t need the +1. I’m a bit of a newbie to CUDA, but this is a programming error.

My solution was to check for overflow in my kernel, and only write into memory if we’re inside bounds:

/* Prevent overflow. */

if(pixelNum < im_width * im_height)

{

    /* Safe to write into buffer, inside bounds. */

}

The may be a more ‘correct’ way to do this, but it’s working for me, for now.

Thanks for your reply :)

Actually, I check for overflow with “if(i<n)”. So, the program won’t try to write in unallocated memory :) However, the kernel invocation is not quite right. The number of blocks should be (n-1)/512+1. This corrects the error which takes place when n is divisible by 512. I knew it from the start, but this was the test program and I knew that n wasn’t divisible by 512 in my case so I got lazy and didn’t correct it :D

But now I’m more concerned on the other problem, the blue screen. Do you have any idea why it happens? :(