[font=“Courier New”][font=“Courier New”]Hello. So far I haven’t received any answer for my questions here, I hope to have better luck this time.
I am trying to learn matrix operations in CUDA (with Tesla C1060). I have an NN symmetric matrix organized in coumn major. I want to increment the upper half elements by a constant. My simple code has a total of NN threads, so only N(N+1)/2 threads will be active.
I run the kernel and check the matrix after 1000 iterations. With the “CorrectKernel” the results are as expected whether the TYPE is double or float. With “WrongKernel”, if the TYPE is float the results are as expected (the upper half is updated to 2000 and the rest of the matrix is 0), but if the TYPE is double the upper half is updated correctly but the rest of the matrix has random values. It seems like with “double” either the memory gets messed up (alignment problems?) or the idle threads don’t stay idle! What am I doing wrong? I appreciate any help.
[font=“Courier New”][font=“Courier New”]
[codebox]#define TYPE double
#define N 256
TYPE *matrix1, *matrix2;
TYPE alpha = 2;
[/font][/font][font=“Courier”][font=“Courier”]CUDA_SAFE_CALL( cudaMalloc( (void**) &matrix1, NNsizeof(TYPE)));
CUDA_SAFE_CALL( cudaMalloc( (void**) &matrix2, NNsizeof(TYPE)));
for(int i=0; i<1000; i++)
CorrectKernel <<<N,N>>>(matrix1, alpha );
for(int i=0; i<1000; i++)
WrongKernel <<<N,N>>>(matrix2, alpha );
global void CorrectKernel(TYPE *base, const TYPE alpha)
{
int bx= blockIdx.x* N + threadIdx.x;
if(threadIdx.x <= blockIdx.x){
base[ bx] += alpha;
else
base[bx] = 0; //force the idle threads to do something
__syncthreads();
}
global void WrongKernel(TYPE *base, const TYPE alpha)
{
int bx= blockIdx.x* N + threadIdx.x;
if(threadIdx.x <= blockIdx.x){
base[ bx] += alpha; //only active threads increment, idle threads stay idle
__syncthreads();
}[/codebox]
[/font][/font]
[font=“Courier New”][font=“Courier New”]
[/font][/font][/font][/font]