Using shared memory issues

I just want to read points from a file and get the distance between points in the map. To decreasing time, i use shared memory function, but my kernels doesn’t work anymore.Can anyone help me?

These are my codes,really simple:

#include<iostream>
#include<sstream>
#include<string>
#include<fstream>
const int maprow=32;
const int mapcolumn=32;
const int points=maprow*mapcolumn;
//const int maxnumber=9999;
//const int startpoint=1;
//const int endpoint=15;

void readfile(int* map)
{
 std::ifstream read;
 read.open("/home/wuhaoran/gridmap.txt");

  for(int i=0;i<maprow;i++)
  {
    for(int j=0;j<mapcolumn;j++)
    {
      read>>map[maprow*i+j];
    }
  }
  /*for(int i=0;i<points;i++)
  {
    std::cout<<map[i]<<" ";
  }*/
}

__global__ void calDistance(int* dev_map, int* dev_distance)
{
  int blockId = blockIdx.x + blockIdx.y * gridDim.x;  
  int threadID = blockId * (blockDim.x * blockDim.y)+ (threadIdx.y * blockDim.x) + threadIdx.x;  
  //int pointA=threadID/points;
  //int pointB=threadID%points;
  int pointA=blockId%gridDim.x*gridDim.x+threadIdx.x;
  int pointB=blockId/gridDim.x*gridDim.x+threadIdx.y;//two points distance
  __shared__ int mapmap[points];
  __shared__ int distancedistance[points];
   mapmap[threadIdx.x+threadIdx.y*blockDim.x]=dev_map[threadIdx.x+threadIdx.y*blockDim.x];
  __syncthreads();
  if(pointB==pointA)
    {distancedistance[threadIdx.x+threadIdx.y*blockDim.x]=0;}
   else if(mapmap[pointB]==1||mapmap[pointA]==1)
    {distancedistance[threadIdx.x+threadIdx.y*blockDim.x]=10000;}
   else if(pointB==pointA+32||pointB==pointA-32||pointB==pointA+1||pointB==pointA-1)
    {
	  if(mapmap[pointB+1]==1||mapmap[pointB-1]==1||mapmap[pointB+32]==1||mapmap[pointB-32]==1)
	    distancedistance[threadIdx.x+threadIdx.y*blockDim.x]=20;
	  else
	    distancedistance[threadIdx.x+threadIdx.y*blockDim.x]=2;
    }
   else if(pointB==pointA+32+1||pointB==pointA-32+1||pointB==pointA+32-1||pointB==pointA-32-1)
    {
          if(mapmap[pointB+1]==1||mapmap[pointB-1]==1||mapmap[pointB+32]==1||mapmap[pointB-32]==1)
	    distancedistance[threadIdx.x+threadIdx.y*blockDim.x]=200;
	  else
	    distancedistance[threadIdx.x+threadIdx.y*blockDim.x]=3;
    }
   else
    {distancedistance[threadIdx.x+threadIdx.y*blockDim.x]=10000;}
    __syncthreads();
    dev_distance[pointA+pointB*points]=distancedistance[threadIdx.x+threadIdx.y*blockDim.x];
}

int main()
{
   int map[points]={0};
   int distance[points*points]={0};
   readfile(map);
   //initialize the distance map
   for(int i=0;i<points*points;i++)
    {distance[i]=10000;}
   int* dev_distance, *dev_map;
   cudaMalloc( (void**)&dev_distance, points*points*sizeof(int) );
   cudaMalloc( (void**)&dev_map, points*sizeof(int) );
   cudaMemcpy(dev_map,map,points*sizeof(int),cudaMemcpyHostToDevice );
   dim3 blockpergrid(32,32);
   dim3 threadperblock(32,32);
   calDistance<<<blockpergrid,threadperblock>>>(dev_map, dev_distance);
   cudaMemcpy(distance,dev_distance,points*points*sizeof(int),cudaMemcpyDeviceToHost);
   cudaDeviceSynchronize();
   for(int i=0;i<(points*points);i++)
   {
   		 std::cout<<distance[i]<<" ";
   		 if(i%points==31)
   		 {std::cout<<std::endl;}
   }
   cudaFree(dev_distance);
   cudaFree(dev_map);
}

You have 1024 blocks each working on the same data. That couldn’t possibly be correct.

Dear Robert,
Can you explain it more specifically? Why i am working on the same data? I just want to initialize the matrix distance, and i think pointA and pointB will change with blockID.

There appear to be a number of problems with this code. I don’t have your data file of course, but when I run your code under cuda-memcheck, I get an invalid shared memory read.

If you get beyond that point, I suspect that your pointA and pointB calculation is either broken or will allow threads in different blocks to have the same pointA and pointB, meaning threads in different blocks will be updating the output at the same location. This would be a race condition. We can agree that each thread in each block computes a different pointA and pointB, right (after all you are adding threadIdx.x and threadIdx.y). And we can agree there are 32x32 threads in each block, right? And there are only 32x32 points total, right? So are you saying amongst all 32x32 threads in each block and across 32x32 blocks, there are no duplications of the calculation of pointA and pointB? Because pointA and pointB are what drive the location of the output to be written to here:

dev_distance[pointA+pointB*points]=...

Your code has enough defects in it that it’s not possible for me to give a precise description.

Get your code to run without any defects reported by cuda-memcheck, and then provide a data file so someone else could run your code.

As an aside, I think this code indexing is complicated enough that I’d be surprised if you have a very good handle on the code indexing you have written. When I have to go research operator precedence between % and * in C++, that to me is borderline unmaintainable code. But that’s just my opinion. If it were me, I wouldn’t write code that way.