Cuda makes my pc crazy

kr1_karin · September 14, 2010, 9:28am

I have a little problem with CUDA last few days.
I have written a very simple program that applies a median filter on a 1D array. It works fine with a float array of little size. Today, I am trying to test it on a very larger array of size 1 440 000. I have set a macro that detects cuda_error after each call to a cuda function.
All is working fine until I reach the cudaMemcpy DeviceToHost which freezes my pc. If I comment out this line of code, this is the cudaFree that freezes my pc…

I don’t know what is happening, has somebody met the same problem ?

jan.heckman · September 14, 2010, 12:08pm

That’s not a lot of information. On the face of it, I can only think that the presumably needed first copy from host to device does something undesirable, maybe out of bounds? Perhaps comment out the first copy and see what happens then to fix the real culprit.

jan.heckman · September 14, 2010, 12:08pm

That’s not a lot of information. On the face of it, I can only think that the presumably needed first copy from host to device does something undesirable, maybe out of bounds? Perhaps comment out the first copy and see what happens then to fix the real culprit.

kr1_karin · September 14, 2010, 2:54pm

I also think this is my mistake, I allocate a float array of size 1 440 000 * sizeof(float) and I bind it to a 1D texture with cudaBindTexture(). Am I out of bound for the texture ? And if so, why cudaBindTexture() doesn’t return an error ?

kr1_karin · September 14, 2010, 2:54pm

I also think this is my mistake, I allocate a float array of size 1 440 000 * sizeof(float) and I bind it to a 1D texture with cudaBindTexture(). Am I out of bound for the texture ? And if so, why cudaBindTexture() doesn’t return an error ?

kr1_karin · September 15, 2010, 9:53am

Here is the code, if someone could help, I’ll be very very grateful.

Nx = 1 440 000

float *device_data_in;

	float *device_data_out;

	size_t size = Nx * sizeof(float);

	cudaMalloc((void**)&device_data_in,  size);

	CUDA_ERRCK

	cudaMalloc((void**)&device_data_out, size);

	CUDA_ERRCK

	cudaMemcpy(device_data_in,  data_in,  size, cudaMemcpyHostToDevice);

	CUDA_ERRCK

		cudaMemcpy(device_data_out, data_out, size, cudaMemcpyHostToDevice);

	CUDA_ERRCK

	cudaMemset(device_data_out, 0.0f, size);

	CUDA_ERRCK

	

	inputTex.addressMode[0] = cudaAddressModeClamp;

	//cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();

		cudaBindTexture(0, inputTex, device_data_in, size);

	CUDA_ERRCK

	// Invoke kernel

	dim3 threadsPerBlock(thread, 0, 0);

	dim3 blocksPerGrid(DIVUP(Nx, thread), 0, 0);

	size_t sharedmemsize = thread * sizeof(float);

	Kernel<<<DIVUP(Nx, thread), thread, sharedmemsize>>>(device_data_out, Nx, order, isodd, start, stop);

	CUDA_ERRCK

	cudaMemcpy(data_out, device_data_out, size, cudaMemcpyDeviceToHost);

	CUDA_ERRCK

	cudaFree(device_data_in);

	CUDA_ERRCK

	cudaFree(device_data_out);

	CUDA_ERRCK

kr1_karin · September 15, 2010, 9:53am

Here is the code, if someone could help, I’ll be very very grateful.

Nx = 1 440 000

float *device_data_in;

	float *device_data_out;

	size_t size = Nx * sizeof(float);

	cudaMalloc((void**)&device_data_in,  size);

	CUDA_ERRCK

	cudaMalloc((void**)&device_data_out, size);

	CUDA_ERRCK

	cudaMemcpy(device_data_in,  data_in,  size, cudaMemcpyHostToDevice);

	CUDA_ERRCK

		cudaMemcpy(device_data_out, data_out, size, cudaMemcpyHostToDevice);

	CUDA_ERRCK

	cudaMemset(device_data_out, 0.0f, size);

	CUDA_ERRCK

	

	inputTex.addressMode[0] = cudaAddressModeClamp;

	//cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();

		cudaBindTexture(0, inputTex, device_data_in, size);

	CUDA_ERRCK

	// Invoke kernel

	dim3 threadsPerBlock(thread, 0, 0);

	dim3 blocksPerGrid(DIVUP(Nx, thread), 0, 0);

	size_t sharedmemsize = thread * sizeof(float);

	Kernel<<<DIVUP(Nx, thread), thread, sharedmemsize>>>(device_data_out, Nx, order, isodd, start, stop);

	CUDA_ERRCK

	cudaMemcpy(data_out, device_data_out, size, cudaMemcpyDeviceToHost);

	CUDA_ERRCK

	cudaFree(device_data_in);

	CUDA_ERRCK

	cudaFree(device_data_out);

	CUDA_ERRCK

avidday · September 15, 2010, 10:28am

Deleted.

avidday · September 15, 2010, 10:28am

Deleted.

kr1_karin · September 16, 2010, 12:06pm

I have found my problem, it’s my fault, I was not writing in the correct chunk of memory.

I have now an another problem. My program works fine when I use this kernel :

(Grid configuration : 32 threads/block, num_bloc = N / thread ->one thread processes one array value)

const int x = blockIdx.x * blockDim.x + threadIdx.x;

const int idx = threadIdx.x;

// Compute & put the float value in shared memory //

__syncthreads();

d_out[x] = median[idx];

I’d like to have better coalesced writes in my global memory. So, I tried :

const int x = blockIdx.x * blockDim.x + threadIdx.x;

const int idx = threadIdx.x;

// Compute & put the float value in an array called "median" in shared memory //

__syncthreads();

int index = blockDim.x / 4;

if (idx >= index) return;

d_out += blockIdx.x * blockDim.x;

((float4 *)d_out)[index] = median_float4[index];

But it gives to me bad results…

kr1_karin · September 16, 2010, 12:06pm

I have found my problem, it’s my fault, I was not writing in the correct chunk of memory.

I have now an another problem. My program works fine when I use this kernel :

(Grid configuration : 32 threads/block, num_bloc = N / thread ->one thread processes one array value)

const int x = blockIdx.x * blockDim.x + threadIdx.x;

const int idx = threadIdx.x;

// Compute & put the float value in shared memory //

__syncthreads();

d_out[x] = median[idx];

I’d like to have better coalesced writes in my global memory. So, I tried :

const int x = blockIdx.x * blockDim.x + threadIdx.x;

const int idx = threadIdx.x;

// Compute & put the float value in an array called "median" in shared memory //

__syncthreads();

int index = blockDim.x / 4;

if (idx >= index) return;

d_out += blockIdx.x * blockDim.x;

((float4 *)d_out)[index] = median_float4[index];

But it gives to me bad results…

Topic		Replies	Views
Urgent help with threads please! CUDA Programming and Performance	21	10783	March 6, 2008
Using unified memory causes system crash CUDA Programming and Performance	28	5779	February 4, 2019
Limitations of a CUDA kernel reached? CUDA Programming and Performance	3	4323	March 7, 2011
CUDA kernels keep on crashing CUDA Programming and Performance	6	3642	October 27, 2008
Cant modify data on the GPU CUDA Programming and Performance	16	10239	December 20, 2008
CudaAPI calls in functions, compiler/linking bug? CUDA Programming and Performance	6	360	August 17, 2023
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	2974	November 13, 2017
Cuda code performance CUDA Programming and Performance	14	3089	December 16, 2014
[Beginner] Memory is reseted in the kernel CUDA Programming and Performance	5	1318	October 29, 2010
An Easy Introduction to CUDA C and C++ Technical Blog	48	1105	July 19, 2018

Cuda makes my pc crazy

Related topics