Problem with vector comparison

rgouyon · October 25, 2022, 7:47am

Hello

I’m doing comparisons between two very long vectors (several million points). The question is to know if these two vectors have points in common and where are these points in each vector.
The vectors are a set of geographic coordinates (longitude, latitude).
Before going to the real vector I made a small test on smaller vectors but it does not work because only the first element are tested and I could not well know why.
Here is my code


#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

cudaError_t compVect(unsigned char *c, const int *lona, const int *lata, const int *lonb, const int *latb, unsigned int sizeA, unsigned int sizeB);

__global__ void compKernel(unsigned char *c, const int *lona, const int *lata, const int *lonb, const int *latb, int nbA, int nbB)
{
	int i = threadIdx.x + blockIdx.x*blockDim.x;
	int j = threadIdx.y + blockIdx.y*blockDim.y;
	int k = i + j * nbA;		
	while (k < nbA*nbB)
	{
		if (lona[i] == lonb[j] && lata[i] == latb[j])
			c[k] = 1;
		else
			c[k] = 0;
		i += gridDim.x*blockDim.x;
		j += gridDim.y*blockDim.y;
		k = i + j * nbA;
	}
}

int main()
{
	int *lon[2];
	int *lat[2];
	int nb[2];
#ifdef TEST
	lon[0] = new int[15];
	lat[0] = new int[15];
	lon[1] = new int[7];
	lat[1] = new int[7];
	nb[0] = 15;
	nb[1] = 7;
	for (int i = 0; i < 15; i++)
	{
		lon[0][i] = i + 1;
		lat[0][i] = 2 * i;
		
	}
	
	for (int i = 0; i < 7; i++)
	{
		lon[1][i] = 3 * i + 1;
		lat[1][i] = 4 * i;
	
	}
	
	lon[0][14] = lon[1][6];
	lat[0][14] = lat[1][6];
	for (int i = 0; i < 15; i++)
	{
		printf("L=%i l=%i   ", lon[0][i], lat[0][i]);
	}
	printf("\n");
	for (int i = 0; i < 7; i++)
	{
		printf("L=%i l=%i   ", lon[1][i], lat[1][i]);
	}
	unsigned char *c = new unsigned char[nb[0] * nb[1]];
	// Add vectors in parallel.
	cudaError_t cudaStatus = compVect(c, lon[0], lat[0], lon[1], lat[1], nb[0], nb[1]);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "addWithCuda failed!");
		return 1;
	}
	for (int i = 0; i < nb[0] * nb[1]; i++)
	{
		if (c[i] == 1)
		{
			int y = i / nb[0];
			int x = i % nb[0];
			printf("A=%i B=%i", x, y);
		}
	}


	// cudaDeviceReset must be called before exiting in order for profiling and
	// tracing tools such as Nsight and Visual Profiler to show complete traces.
	cudaStatus = cudaDeviceReset();
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaDeviceReset failed!");
		return 1;
	}

	return 0;
}

// Helper function for using CUDA to add vectors in parallel.
cudaError_t compVect(unsigned char *c, const int *lona, const int *lata, const int *lonb, const int *latb, unsigned int sizeA, unsigned int sizeB)
{
	int *dev_lona = 0;
	int *dev_lonb = 0;
	int *dev_lata = 0;
	int *dev_latb = 0;

	unsigned char *dev_c = 0;
	cudaError_t cudaStatus;

	// Choose which GPU to run on, change this on a multi-GPU system.
	cudaStatus = cudaSetDevice(0);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
		goto Error;
	}

	// Allocate GPU buffers for three vectors (two input, one output)    .
	cudaStatus = cudaMalloc((void**)&dev_c, sizeA*sizeB * sizeof(unsigned char));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMalloc failed!");
		goto Error;
	}

	cudaStatus = cudaMalloc((void**)&dev_lona, sizeA * sizeof(int));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMalloc failed!");
		goto Error;
	}
	cudaStatus = cudaMalloc((void**)&dev_lata, sizeA * sizeof(int));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMalloc failed!");
		goto Error;
	}
	cudaStatus = cudaMalloc((void**)&dev_lonb, sizeB * sizeof(int));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMalloc failed!");
		goto Error;
	}
	cudaStatus = cudaMalloc((void**)&dev_latb, sizeB * sizeof(int));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMalloc failed!");
		goto Error;
	}
	// Copy input vectors from host memory to GPU buffers.
	cudaStatus = cudaMemcpy(dev_lona, lona, sizeA * sizeof(int), cudaMemcpyHostToDevice);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}
	cudaStatus = cudaMemcpy(dev_lata, lata, sizeA * sizeof(int), cudaMemcpyHostToDevice);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}
	cudaStatus = cudaMemcpy(dev_lonb, lonb, sizeB * sizeof(int), cudaMemcpyHostToDevice);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}
	cudaStatus = cudaMemcpy(dev_latb, latb, sizeB * sizeof(int), cudaMemcpyHostToDevice);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}
	// Launch a kernel on the GPU with one thread for each element.
	dim3 grid(2, 2);
	compKernel << <grid, grid >> > (dev_c, dev_lona, dev_lata, dev_lonb, dev_latb,sizeA,sizeB);

	// Check for any errors launching the kernel
	cudaStatus = cudaGetLastError();
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
		goto Error;
	}

	// cudaDeviceSynchronize waits for the kernel to finish, and returns
	// any errors encountered during the launch.
	cudaStatus = cudaDeviceSynchronize();
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
		goto Error;
	}

	// Copy output vector from GPU buffer to host memory.
	cudaStatus = cudaMemcpy(c, dev_c, sizeA * sizeB * sizeof(unsigned char), cudaMemcpyDeviceToHost);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}

Error:
	cudaFree(dev_c);
	cudaFree(dev_lona);
	cudaFree(dev_lonb);
	cudaFree(dev_lata);
	cudaFree(dev_latb);
	return cudaStatus;
}

Robert_Crovella · October 25, 2022, 9:30pm

For two vectors each of several million points, your c vector would need to be larger than 1 terabyte. There aren’t any CUDA GPUs currently with that much memory. So you’d have to do something a bit unusual like managed memory with oversubscription, if your system has that much memory.

Regarding the code, you don’t do a “2-dimensional” grid stride loop the way you have written. That pattern you have will only “stride” down the “diagonal” of the output space. To create a 2D grid-stride loop, will require 2 loops, or else some very involved indexing.

rgouyon · October 26, 2022, 8:23pm

In fact what I exactly wanted to do is this:

for(int i=0;i<nbA;i++)
{
 for(int j=0;j<nbB;j++)
   {
      k=j+i*nbA;
      if (vecA[i]==vecB[j])
       c[k]=1;
     else
       c[k]=0;
  }
}

And you are right for the numbers of points in each vectors I have to reduce them.

Robert_Crovella · October 26, 2022, 8:43pm

Do you see how you have two loops there? You will want two loops for the grid-stride loop method, also.

__global__ void compKernel(unsigned char *c, const int *lona, const int *lata, const int *lonb, const int *latb, int nbA, int nbB)
{
        for (int j = threadIdx.y + blockIdx.y*blockDim.y; j < nbB; j+=gridDim.y*blockDim.y)
          for (int i = threadIdx.x + blockIdx.x*blockDim.x; i < nbA; i+=gridDim.x*blockDim.x){
	        int k = i + j * nbA;		
	        if (lona[i] == lonb[j] && lata[i] == latb[j])
			  c[k] = 1;
		    else
		      c[k] = 0;
	    }
}

When I make that change to your code, instead of output like this:

A=0 B=0

I get output like this:

A=0 B=0 A=14 B=6

rgouyon · October 27, 2022, 6:20am

This is the good result.
Thanks a lot

cbuchner1 · October 27, 2022, 1:15pm

I am wondering if there may be a more efficient approach.

what about sorting the vectors along e.g the x dimension (maybe not physically rearranging the vectors - but rather creating a vector of indices in the rearranged order). Then you only have to compare vector elements with identical x coordinates.

Alternatively building a quadtree of coordinates (or maybe using some other space partitioning method), and then doing comparisons within the leaf nodes only.

This could even use the hardware raytracing cores of CUDA, if you leverage e.g. the optiX library for it.

Christian

Robert_Crovella · October 27, 2022, 1:40pm

If the “density” of matches is low, I would want to first make sure that only the latitude (or only the longitude) is being compared if there is no match there. The compiler may have already discovered that, but I would want to confirm that in some fashion.

After that, depending on GPU and actual problem size, the biggest problem I see is that (at least) one of the vectors is being read multiple times. On an A100 for example, a L2 cache carveout for that vector might be interesting to explore, and of course loading the frequently-read vector into shared memory might be interesting to explore. This type of problem might benefit from carefully crafting the work to fit the actual GPU being run on.

The problem could be subdivided to “fit” on smaller GPUs by dividing one of the inputs into segments, and processing a segment at a time, perhaps dividing the work up with a cooperative groups grid-sync. That plus the previous ideas may allow you to get to the ideal scenario where you are only reading each input vector exactly once.

rgouyon · October 27, 2022, 8:03pm

In fact the vectors are a geographical path (a coast line to be more precise) so the order of the points in the vectors is important.
Generally the vectors overlap at their extremities and sometimes one vector is entirely contained within another. But for this last case I use another faster algorithm.
So one solution to accelerate things may be testing the extremities and stop the search as soon as there is no more match.

Topic		Replies	Views
HELP with vector sum CUDA Programming and Performance	6	2216	May 11, 2010
beginner question Checking if GPU is the answer to me CUDA Programming and Performance	20	12081	September 4, 2008
Using unified memory causes system crash CUDA Programming and Performance	28	5857	February 4, 2019
Cuda code performance CUDA Programming and Performance	14	3138	December 16, 2014
a problem complex array add with cuda ????? CUDA Programming and Performance	2	983	August 17, 2017
Time consuming comparison between 820M GPU combination code and pure C++ on CPU CUDA Programming and Performance	2	488	May 22, 2019
Getting started with CUDA ... cannot add simple vectors CUDA Programming and Performance	9	20928	January 31, 2011
1080 does not support doubles? CUDA Programming and Performance	5	553	October 30, 2018
Cuda application crashes works fine for small data and crashes for big data CUDA Programming and Performance	3	413	October 12, 2021
When i use CUDA, my program gets slower CUDA Programming and Performance	5	529	April 2, 2019

Problem with vector comparison

Related topics