OS: Fedora 10, using KDE 4.1.3 (latest stable for Fedora) and KWIN’s compositing effects enabled.
NVIDIA driver: NVIDIA-Linux-x86-180.06-pkg1.run (2.1 Beta enabled)
CUDA Toolkit: 2.1 Beta
CUDA SDK: 2.1 Beta
GNU compiler: gcc (GCC) 4.3.2 20081105 (Red Hat 4.3.2-7)
CPU: Core 2 Duo 1.67 GHz
RAM: 3 GB DDR-2
GPU: nVIDIA GeForce 8400M GS
(HP Pavilion dv6775us)
Problems:
1.) no error is given when kernel is launched with incorrect parameters:
Code snippet:
CUDA_SAFE_CALL( cudaThreadSynchronize() );
dim3 dimBlock (4096);
dim3 dimGrid (ROWS/TBLOCK); //1 Thread Block per ogni BLOCK_SIZE^2 colonne di A
PRINT_N (dimBlock.x);
PRINT_N (dimGrid.x);
MatTest<<<dimGrid, dimBlock>>>(d_C, d_A, d_B); //Result: :)
CUT_CHECK_ERROR("MatTest() execution failed\n");
CUDA_SAFE_CALL( cudaThreadSynchronize() );
Output (complete):
Initializing data...
...allocating CPU memory.
Matrix is 4096x4096
Vector is 4096x1
Using device 0: GeForce 8400M GS
Exec time only on CPU: 26.816000 (ms)
...allocating GPU memory.
...copying input data to GPU mem.
Data init done.
Executing GPU kernel...
"Using Shared Memory..."
---4096 ---
---16 ---
Reading back GPU result...
Transfer + Exec + Readback time on GPU with CUDA: 35.998001 (ms)
Execution time on GPU with CUDA: 0.076999 (ms)
Transfer to GPU with CUDA: 35.476002 (ms)
Transfer from GPU with CUDA: 0.445000 (ms)
Risultati CPU (C/C++):
C_CPU.x= 2.000000 C_CPU.y= 1.000000 C_CPU.z= 1.000000 C_CPU.w= 1.000000
Risultati GPU (CUDA):
C_GPU.x= 4.000000 C_GPU.y= 2.000000 C_GPU.z= 2.000000 C_GPU.w= 2.000000
Index: 0
a[0]: 2.000000 , b[0]: 4.000000
h_C_CPU != h_C_GPU ... :(.
Shutting down...
(disregard the output for now,set wrong on purpose)
On Windows (CUDA 2.0) this causes an error to be thrown (even in “Release” configuration so we are comparing both on the same footing).
2.) device memory is not correctly initialize/set/freed…
Let’s say I run my Matrix * Vector operation using the a working code path in my application (C-preprocessor #if #endif block to specify codepaths) and thus I have the output matrix set (whether with correct results or not)… the application shuts down freeing the device memory too…
CUDA_SAFE_CALL( cudaFree(d_C) );
CUDA_SAFE_CALL( cudaFree(d_B) );
CUDA_SAFE_CALL( cudaFree(d_A) );
free(h_C_GPU);
If I execute the application again without even calling the kernel:
#if SHARED_MEM == 1
printf ("\n\n\"Using Shared Memory...\"\n\n");
#endif
#if SHARED_MEM == 0
printf ("\n\n\"Not using Shared Memory...\"\n\n");
#endif
CUDA_SAFE_CALL( cudaThreadSynchronize() );
dim3 dimBlock (4096);
dim3 dimGrid (ROWS/TBLOCK); //1 Thread Block per ogni BLOCK_SIZE^2 colonne di A
PRINT_N (dimBlock.x);
PRINT_N (dimGrid.x);
//MatTest<<<dimGrid, dimBlock>>>(d_C, d_A, d_B); //Result: :)
CUT_CHECK_ERROR("MatTest() execution failed\n");
CUDA_SAFE_CALL( cudaThreadSynchronize() );
//fromGPU
start_timer(&timer_toRAM);
printf("Reading back GPU result...\n\n");
CUDA_SAFE_CALL( cudaMemcpy(h_C_GPU, d_C, DATA_V, cudaMemcpyDeviceToHost) );
stop_timer(timer_toRAM, &t_toRAM_ms);
//data transfered
stop_timer(timer1, &timer1_ms);//Timer stopped
but still trying to allocate and initialize the data (all of the following is of course run before the code block I just posted a few lines above):
void init_test1_data_CUDA (float** h_C_GPU,
float * &d_A, float * &d_B, float * &d_C)
{
*h_C_GPU = (float *)calloc(N_EL, sizeof(float));
for(int i = 0; i < ROWS; i++){
(*h_C_GPU)[i] = 0.0f;
}
printf("...allocating GPU memory.\n");
CUDA_SAFE_CALL( cudaMalloc((void **)&d_A, DATA_SZ) ); //input matrix
CUDA_SAFE_CALL( cudaMalloc((void **)&d_B, DATA_V) ); //input vector
CUDA_SAFE_CALL( cudaMalloc((void **)&d_C, DATA_V) ); //result vector
CUDA_SAFE_CALL(cudaMemset((void **)&d_A, 0, ROWS*COLS));
CUDA_SAFE_CALL(cudaMemset((void **)&d_B, 0, ROWS));
CUDA_SAFE_CALL(cudaMemset((void **)&d_C, 0, ROWS));
return;
}
And then I retrieve the output like so:
CUDA_SAFE_CALL( cudaMemcpy(h_C_GPU, d_C, DATA_V, cudaMemcpyDeviceToHost) );
The h_C_GPU matrix contains the same value as with the previous kernel invocation as if VRAM had never been freed, re-allocated, and memset-ed to 0 in the pass in which the application (with the kernel invocation commented out) was run… but NO ERROR is thrown (and CUDA_SAFE_CALL, going by cutil.h, should catch an error by either cudaMalloc or cudaMemset if thrown…)