CUDA 2.1 Beta Problem/Bugs (Linux)

OS: Fedora 10, using KDE 4.1.3 (latest stable for Fedora) and KWIN’s compositing effects enabled.

NVIDIA driver: NVIDIA-Linux-x86-180.06-pkg1.run (2.1 Beta enabled)

CUDA Toolkit: 2.1 Beta

CUDA SDK: 2.1 Beta

GNU compiler: gcc (GCC) 4.3.2 20081105 (Red Hat 4.3.2-7)

CPU: Core 2 Duo 1.67 GHz

RAM: 3 GB DDR-2

GPU: nVIDIA GeForce 8400M GS

(HP Pavilion dv6775us)

Problems:

1.) no error is given when kernel is launched with incorrect parameters:

Code snippet:

CUDA_SAFE_CALL( cudaThreadSynchronize() );

	dim3 dimBlock (4096);

	dim3 dimGrid (ROWS/TBLOCK); //1 Thread Block per ogni BLOCK_SIZE^2 colonne di A

	PRINT_N (dimBlock.x);

	PRINT_N (dimGrid.x);

	MatTest<<<dimGrid, dimBlock>>>(d_C, d_A, d_B); //Result: :)

	CUT_CHECK_ERROR("MatTest() execution failed\n");

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

Output (complete):

Initializing data...

...allocating CPU memory.

Matrix is 4096x4096

Vector is 4096x1

Using device 0: GeForce 8400M GS

Exec time only on CPU: 26.816000 (ms)

...allocating GPU memory.

...copying input data to GPU mem.

Data init done.

Executing GPU kernel...

"Using Shared Memory..."

---4096 ---

---16 ---

Reading back GPU result...

Transfer + Exec + Readback time on GPU with CUDA: 35.998001 (ms)

Execution time on GPU with CUDA: 0.076999 (ms)

Transfer to GPU with CUDA: 35.476002 (ms)

Transfer from GPU with CUDA: 0.445000 (ms)

Risultati CPU (C/C++):

C_CPU.x= 2.000000 C_CPU.y= 1.000000 C_CPU.z= 1.000000 C_CPU.w= 1.000000

Risultati GPU (CUDA):

C_GPU.x= 4.000000 C_GPU.y= 2.000000 C_GPU.z= 2.000000 C_GPU.w= 2.000000

Index: 0

a[0]: 2.000000 , b[0]: 4.000000

h_C_CPU != h_C_GPU ... :(.

Shutting down...

(disregard the output for now,set wrong on purpose)

On Windows (CUDA 2.0) this causes an error to be thrown (even in “Release” configuration so we are comparing both on the same footing).

2.) device memory is not correctly initialize/set/freed…

Let’s say I run my Matrix * Vector operation using the a working code path in my application (C-preprocessor #if #endif block to specify codepaths) and thus I have the output matrix set (whether with correct results or not)… the application shuts down freeing the device memory too…

CUDA_SAFE_CALL( cudaFree(d_C) );

CUDA_SAFE_CALL( cudaFree(d_B)   );

CUDA_SAFE_CALL( cudaFree(d_A)   );

free(h_C_GPU);

If I execute the application again without even calling the kernel:

#if SHARED_MEM == 1

	printf ("\n\n\"Using Shared Memory...\"\n\n");

#endif

#if SHARED_MEM == 0

	printf ("\n\n\"Not using Shared Memory...\"\n\n");

#endif

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	dim3 dimBlock (4096);

	dim3 dimGrid (ROWS/TBLOCK); //1 Thread Block per ogni BLOCK_SIZE^2 colonne di A

	PRINT_N (dimBlock.x);

	PRINT_N (dimGrid.x);

	//MatTest<<<dimGrid, dimBlock>>>(d_C, d_A, d_B); //Result: :)

	CUT_CHECK_ERROR("MatTest() execution failed\n");

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	//fromGPU

	start_timer(&timer_toRAM);

	printf("Reading back GPU result...\n\n");

	CUDA_SAFE_CALL( cudaMemcpy(h_C_GPU, d_C, DATA_V, cudaMemcpyDeviceToHost) );

	stop_timer(timer_toRAM, &t_toRAM_ms);

	//data transfered

	stop_timer(timer1, &timer1_ms);//Timer stopped

but still trying to allocate and initialize the data (all of the following is of course run before the code block I just posted a few lines above):

void init_test1_data_CUDA (float** h_C_GPU,

		float * &d_A, float * &d_B,  float * &d_C)

{

	*h_C_GPU = (float *)calloc(N_EL, sizeof(float));

	for(int i = 0; i < ROWS; i++){

		(*h_C_GPU)[i] = 0.0f;

	}

	printf("...allocating GPU memory.\n");

	CUDA_SAFE_CALL( cudaMalloc((void **)&d_A, DATA_SZ)   ); //input matrix

	CUDA_SAFE_CALL( cudaMalloc((void **)&d_B, DATA_V)   ); //input vector

	CUDA_SAFE_CALL( cudaMalloc((void **)&d_C, DATA_V)   ); //result vector

	CUDA_SAFE_CALL(cudaMemset((void **)&d_A, 0, ROWS*COLS));

	CUDA_SAFE_CALL(cudaMemset((void **)&d_B, 0, ROWS));

	CUDA_SAFE_CALL(cudaMemset((void **)&d_C, 0, ROWS));

	return;

}

And then I retrieve the output like so:

CUDA_SAFE_CALL( cudaMemcpy(h_C_GPU, d_C, DATA_V, cudaMemcpyDeviceToHost) );

The h_C_GPU matrix contains the same value as with the previous kernel invocation as if VRAM had never been freed, re-allocated, and memset-ed to 0 in the pass in which the application (with the kernel invocation commented out) was run… but NO ERROR is thrown (and CUDA_SAFE_CALL, going by cutil.h, should catch an error by either cudaMalloc or cudaMemset if thrown…)

attaching the whole project in a zip file… to be unpacked into … /NVIDIA_CUDA_SDK/projects/
MatrixTest.zip (13 KB)

One mistake, re-running this on windows showed this again (I thought I took it out already, but I guess I forgot…)… I’ll also re-upload the .zip file…

CUDA_SAFE_CALL(cudaMemset((void **)&d_A, 0, ROWS*COLS));

CUDA_SAFE_CALL(cudaMemset((void **)&d_B, 0, ROWS));

CUDA_SAFE_CALL(cudaMemset((void **)&d_C, 0, ROWS));

CUDA_SAFE_CALL(cudaMemset((void *)d_A, 0, DATA_SZ));

CUDA_SAFE_CALL(cudaMemset((void *)d_B, 0, DATA_V));

CUDA_SAFE_CALL(cudaMemset((void *)d_C, 0, DATA_V));

where DATA_V = N_EL * sizeof(float) for example

(I’ll boot back into Linux and see if that changes anything… still again the CUDA runtime does not give me any error there…)

//MatTest<<<dimGrid, dimBlock>>>(d_C, d_A, d_B); //Result: :)

	CUDA_SAFE_CALL(cudaMemset((void *)d_C, 5, DATA_V));

	CUT_CHECK_ERROR("MatTest() execution failed\n");

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	//fromGPU

	start_timer(&timer_toRAM);

	printf("Reading back GPU result...\n\n");

	CUDA_SAFE_CALL( cudaMemcpy(h_C_GPU, d_C, DATA_V, cudaMemcpyDeviceToHost) );

How come the h_C_GPU array is filled with 0’s? The cudaMemset (running on Windows still at the moment) produces no error and should fill the d_C array on the device with 5’s, but as you can see the output is all 0’s…

Initializing data...

...allocating CPU memory.

Matrix is 4096x4096

Vector is 4096x1

Exec time only on CPU: 103.758018 (ms)

Using device 0: GeForce 8400M GS

...allocating GPU memory.

...copying input data to GPU mem.

Data init done.

"Using Shared Memory..."

---256 ---

---16 ---

Reading back GPU result...

Transfer + Exec + Readback time on GPU with CUDA: 38.185299 (ms)

Execution time on GPU with CUDA: 0.375187 (ms)

Transfer to GPU with CUDA: 37.595490 (ms)

Transfer from GPU with CUDA: 0.214622 (ms)

Risultati CPU (C/C++):

C_CPU.x= 2.000000 C_CPU.y= 1.000000 C_CPU.z= 1.000000 C_CPU.w= 1.000000

Risultati GPU (CUDA):

C_GPU.x= 0.000000 C_GPU.y= 0.000000 C_GPU.z= 0.000000 C_GPU.w= 0.000000

Index: 0

a[0]: 2.000000 , b[0]: 0.000000

h_C_CPU != h_C_GPU ... :(.

Shutting down...

Press ENTER to exit...

The output is the way it is because it is treated like this after the data is copied back…

CUDA_SAFE_CALL( cudaMemcpy(h_C_GPU, d_C, DATA_V, cudaMemcpyDeviceToHost) );

	stop_timer(timer_toRAM, &t_toRAM_ms);

	//data transfered

	stop_timer(timer1, &timer1_ms);//Timer stopped

	float exec_time_CUDA = (timer1_ms - t_toGDDR_ms - t_toRAM_ms);

	////

	printf ("\n\nTransfer + Exec + Readback time on GPU with CUDA: %f (ms)\n", timer1_ms);

	printf ("\nExecution time on GPU with CUDA: %f (ms)\n", exec_time_CUDA);

	printf ("\nTransfer to GPU with CUDA: %f (ms)\n", t_toGDDR_ms);

	printf ("\nTransfer from GPU with CUDA: %f (ms)\n", t_toRAM_ms);

	//Y + x/100 * Y = Z ...  x = ((Z-Y)/Y) * 100

	printf ("Risultati CPU (C/C++):\n");

	printf ("\nC_CPU.x= %f C_CPU.y= %f C_CPU.z= %f C_CPU.w= %f\n", h_C_CPU[0], h_C_CPU[1],

		h_C_CPU[2], h_C_CPU[3]);

	printf ("Risultati GPU (CUDA):\n");

	printf ("\nC_GPU.x= %f C_GPU.y= %f C_GPU.z= %f C_GPU.w= %f\n", h_C_GPU[0], h_C_GPU[1],

		h_C_GPU[2], h_C_GPU[3]);

	if (!vectorEQ(h_C_CPU, h_C_GPU, COLS)) printf("\nh_C_CPU != h_C_GPU ... :(.\n");

	else printf("\nh_C_CPU == h_C_GPU... :).\n");

	printf("\nShutting down...\n");

	CUDA_SAFE_CALL( cudaFree(d_C) );

	CUDA_SAFE_CALL( cudaFree(d_B)   );

	CUDA_SAFE_CALL( cudaFree(d_A)   );

	free(h_C_GPU);

#endif

h_C_CPU = where the result calculated by the C/C++ code is stored.

h_C_GPU = where the result calculated by the CUDA or CUBLAS code (depending on the codepath) is stored.

matA = source matrix in host memory

vecB = source vector in host memory

d_A = source matrix in device memory (matA → d_A)

d_B = source vector in device memory

d_C (or d_C1 in the CUBLAS code) = destination vector in device memory.

Ok, about the cudaMemset part… sorry about it…

[url=“http://forums.nvidia.com/index.php?showtopic=72209&hl=cudaMemset”]http://forums.nvidia.com/index.php?showtop...p;hl=cudaMemset[/url]

this kinda explains it (why it works with 0, but not using an int like 5 or something… when setting an array of floats)…

Update on code performance status…

http://forums.nvidia.com/index.php?s=&…st&p=487032