Problem in global memory access when multiple threads trying to access same location

Hi,
I am kind of stuck at a very simple problem…

Here are the details:

  • I have a 2225 x 9635 2D array as input
  • From this one I want to generate a 2225 x 2225 array, which is just Dot product of 2 rows in input array

Here is the Kernel for the same:
#define DOCS 2225
#define TERM 9635

global void compute_similarity(float * term_doc_M, float * weight_mat,float cut_off)
{
int bx = blockIdx.x;
int tx = threadIdx.x;
int by = blockIdx.y;
int ty = threadIdx.y;

    int i = by* blockDim.y +  ty;
    int j = bx* blockDim.x +  tx;

    if(i < DOCS && j < DOCS)
            weight_mat[i* DOCS + j ]= 0 ;

    float tmp =0, tmp1=0, tmp2=0;

    if(i < DOCS && j < DOCS)
    {
            if(i == j)
            {
                    weight_mat[i* DOCS + j ]=1 ;
                    return ;
            }
          if(j >  i)
                  return;

          for(int k =0 ; k < TERMS; k++)
            {
                    tmp += term_doc_M[j* TERMS  + k] * term_doc_M[i*TERMS+k];
            }
             weight_mat[i* DOCS + j ]= tmp;
             weight_mat[j* DOCS + i ]=tmp ;
    }

}

Now, problem here is: I am getting answer as all 0s. Dont know why…

Here are things I have verified:

  1. The input array contains correct data and its not 0.
  2. When i try to access ith row, it is givin correct data. But when i try to access jth row, it is returning all 0s :( It does not seem to follow any logic. If i try to access a single element instead of entire row for j, it returns correct value again… I am totally clueless what is happening here… :(

Any help is really appreciated!

Thank you!

Hi,
I am kind of stuck at a very simple problem…

Here are the details:

  • I have a 2225 x 9635 2D array as input
  • From this one I want to generate a 2225 x 2225 array, which is just Dot product of 2 rows in input array

Here is the Kernel for the same:
#define DOCS 2225
#define TERM 9635

global void compute_similarity(float * term_doc_M, float * weight_mat,float cut_off)
{
int bx = blockIdx.x;
int tx = threadIdx.x;
int by = blockIdx.y;
int ty = threadIdx.y;

    int i = by* blockDim.y +  ty;
    int j = bx* blockDim.x +  tx;

    if(i < DOCS && j < DOCS)
            weight_mat[i* DOCS + j ]= 0 ;

    float tmp =0, tmp1=0, tmp2=0;

    if(i < DOCS && j < DOCS)
    {
            if(i == j)
            {
                    weight_mat[i* DOCS + j ]=1 ;
                    return ;
            }
          if(j >  i)
                  return;

          for(int k =0 ; k < TERMS; k++)
            {
                    tmp += term_doc_M[j* TERMS  + k] * term_doc_M[i*TERMS+k];
            }
             weight_mat[i* DOCS + j ]= tmp;
             weight_mat[j* DOCS + i ]=tmp ;
    }

}

Now, problem here is: I am getting answer as all 0s. Dont know why…

Here are things I have verified:

  1. The input array contains correct data and its not 0.
  2. When i try to access ith row, it is givin correct data. But when i try to access jth row, it is returning all 0s :( It does not seem to follow any logic. If i try to access a single element instead of entire row for j, it returns correct value again… I am totally clueless what is happening here… :(

Any help is really appreciated!

Thank you!

What’s your grid and block size?

What’s your grid and block size?

Block size: 16
Grid size: ceiling[2225/16] x ceiling[2225/16]
i.e. 140 x 140

Block size: 16
Grid size: ceiling[2225/16] x ceiling[2225/16]
i.e. 140 x 140

Why not use the standard matrix multiplication routine?

Why not use the standard matrix multiplication routine?

It is just similar to that… Is it like I cannot do a matrix multiplication for matrices of size 2000 x 9000 and 9000 x 2000 lets say…? Because my kernel does not launch at all…

It is just similar to that… Is it like I cannot do a matrix multiplication for matrices of size 2000 x 9000 and 9000 x 2000 lets say…? Because my kernel does not launch at all…

I think that’s because

reset to 0!

This is useless because you compute it later, remove it !

Yves

I think that’s because

reset to 0!

This is useless because you compute it later, remove it !

Yves

@globs47:
Hmm… yeah, a redundant step. But i dont think that should affect the problem I have posted here… I had just put it for some other check :P

@globs47:
Hmm… yeah, a redundant step. But i dont think that should affect the problem I have posted here… I had just put it for some other check :P

check it because you carry out the computation for j <= i while intialization is done even for j > i which means that some threads are going to initiate what has already been done !

Yves

check it because you carry out the computation for j <= i while intialization is done even for j > i which means that some threads are going to initiate what has already been done !

Yves

Your kernel doesn’t launch? You didn’t say that before; you said it returned all zeros.

Your kernel doesn’t launch? You didn’t say that before; you said it returned all zeros.

Just to be clear, you have …

dim3 dimBlock(16,16);

dim3 dimGrid(140,140);

compute_similarity<<<dimGrid,dimBlock>>>(…)

Just to be clear, you have …

dim3 dimBlock(16,16);

dim3 dimGrid(140,140);

compute_similarity<<<dimGrid,dimBlock>>>(…)