GPU execution question

Hello, I am trying to get acquainted with CUDA before changing any of my existing code to use GPU-acceleration. I started that process by buying a mediocre card (GTX-550 Ti) and writing a simple code that I thought would load the GPU and let me watch it’s temperature over time and also compare its performance to a CPU-only program. The code I wrote defines a 3D image and has the GPU do some meaningless calculations on each pixel of the image. I print the value before and after sending the image to the GPU. I also added a loop both on the CPU and GPU side to ensure a high degree of loading. What I’ve noticed is that over time the device code seems to stop executing. For example, the value of a pixel will be calculated correctly for the first 200-or so iterations of the CPU loop but then will not change from the initialization value.

Also, I have noticed that my computer freezes during device execution and un-freezes during CPU execution. For example, if I type during device execution, the letters will not appear until the current device code terminates.

Just wondering if anyone can fill me in on what’s going on. Code below, thanks.


using namespace std;

__global__ void calcs(float *image,int N_2D){

	int bid=blockIdx.x;

	int tid=threadIdx.x;

	int iter_size=N_2D/blockDim.x;

	//Loop just to load the processor

	for(unsigned int j=0;j<1000;++j){

		//Loop over this thread's pixels/elements

		for(unsigned int i_dev=0;i_dev<iter_size;++i_dev){

			//Just calculate some stuff and set the current pixel/element equal to it





int main(){

	//X and Y dimensions

	const int n_dim=512;

	//Z dimension

	const int z_dim=190;

	//Pixels in one z element

	int N_2D=n_dim*n_dim;

	//3D image and device container

	float * image=new float[n_dim*n_dim*z_dim];

	float * image1;

	//Get device properties so we know max number of threads

	cudaDeviceProp prop;


	//Loop just to make sure the program runs for a while

	for(unsigned int i=0;i<1000;++i){

		cout<<"iteration :"<<i<<endl;

		//Initialize image array to i

		for(unsigned int z=0;z<z_dim;++z){

			for(unsigned int y=0;y<n_dim;++y){

				for(unsigned int x=0;x<n_dim;++x){





		//Allocate memory on the device for image


		//Copy image to the device


		//Execute device function with number of blocks = z_dim and maximum number of threads


		//Copy result back to host


		//Free device memory


		//Output results to ensure consistancy

		cout<<"result: "<<image[3]<<endl;


return 1;
