Copy hash table from GPU to CPU

yft306899675 · March 21, 2019, 5:36pm

Hello everyone,

I met a problem when I copy the hash table from gpu to cpu. I tested the hash table in gpu and it has been properly built. However, when I try to copy the whole table from gpu to cpu using the function copy_table_to_host given in the following, the entry->next always points to itself. I don’t know why this happens. Does anyone know how to solve this problem?

#define imin(a,b) (a<b?a:b)
const int N = 1024 * 1024;
const int threadsPerBlock = 256;
const int blocksPerGrid = imin(32, (N + threadsPerBlock - 1) / threadsPerBlock);
struct Lock {
	int *mutex;
	Lock(void) {
		int state = 0;
		cudaMalloc((void**)&mutex, sizeof(int));
		cudaMemcpy(mutex, &state, sizeof(int), cudaMemcpyHostToDevice);
	}
	__device__ void lock(void) {
		while (atomicCAS(mutex, 0, 1) != 0);
		__threadfence();
	}
	__device__ void unlock(void) {
		atomicExch(mutex, 0);
		__threadfence();
	}
};
struct Entry{
	unsigned int key;
	float value;
	Entry *next;
};

struct Table{
	size_t count;
	Entry **entries;
	Entry *pool;
};
__device__ __host__ size_t hash(unsigned int value, size_t count){
	return value % count;
}

void initialize_table(Table &table, int nb_entries, int nb_elements){
	table.count = nb_entries;
	cudaMalloc((void**)&table.entries, nb_entries*sizeof(Entry*));
	cudaMemset(table.entries, 0, nb_entries*sizeof(Entry*));
	cudaMalloc((void**)&table.pool, nb_elements*sizeof(Entry));
}
void free_table(Table &table)
{
	cudaFree(table.pool);
	cudaFree(table.entries);
}
void copy_table_to_host(const Table &table, Table &hostTable){
	hostTable.count = table.count;
	hostTable.entries = (Entry**)calloc(table.count, sizeof(Entry*));
	hostTable.pool = (Entry*)malloc(N*sizeof(Entry));

	cudaMemcpy(hostTable.entries, table.entries, table.count*sizeof(Entry*),cudaMemcpyDeviceToHost);
	cudaMemcpy(hostTable.pool, table.pool, N * sizeof(Entry), cudaMemcpyDeviceToHost);

	for (int i = 0; i<N; i++){
		if (hostTable.pool[i].next != NULL)
			hostTable.pool[i].next = (Entry*)((size_t)hostTable.pool[i].next - (size_t)table.pool + (size_t)hostTable.pool);
	}
	for (int i=0;i<table.count;i++){
		if (hostTable.entries[i] != NULL)
			hostTable.entries[i] = (Entry*)((size_t)hostTable.entries[i] - (size_t)table.pool + (size_t)hostTable.pool);
	}
}
__global__ void add_to_table(unsigned int *keys, float *values, Table table, Lock *lock)
{
	int tid = threadIdx.x + blockIdx.x*gridDim.x;
	int stride = blockDim.x*gridDim.x;
	while (tid<N){
		unsigned int key = keys[tid];
		size_t hashValue = hash(key, table.count);
		for (int i=0;i<32;i++)
		{
			if ((tid % 32) == i)
			{
				Entry *location = &(table.pool[tid]);
				location->key = key;
				location->value = values[tid];
				lock[hashValue].lock();
				location->next = table.entries[hashValue];
				table.entries[hashValue] = location;
				lock[hashValue].unlock();
			}
		}
		tid += stride;
	}
}

void hash_table_test()
{
	unsigned int *buffer = (unsigned int*)malloc(N*sizeof(unsigned int));
	for (int i = 0; i<N; i++) buffer[i] = (unsigned int)i;
	float *values;
	values = (float*)malloc(N*sizeof(float));
	for (int i = 0; i<N; i++)
		values[i] = (float)i*0.01;

	unsigned int *dev_keys;
	float *dev_values;

	cudaMalloc((void**)&dev_keys, N * sizeof(unsigned int));
	cudaMalloc((void**)&dev_values, N * sizeof(float));
	cudaMemcpy(dev_keys,buffer, N * sizeof(unsigned int),cudaMemcpyHostToDevice);
	cudaMemcpy(dev_values, values, N * sizeof(float), cudaMemcpyHostToDevice);

	Table table;
	initialize_table(table, HASH_ENTRIES, N);

	Lock lock[HASH_ENTRIES];
	Lock *dev_lock;

	cudaMalloc((void**)&dev_lock, HASH_ENTRIES * sizeof(Lock));
	cudaMemcpy(dev_lock, lock, HASH_ENTRIES * sizeof(Lock), cudaMemcpyHostToDevice);

	add_to_table << < blocksPerGrid, threadsPerBlock >> > (dev_keys, dev_values, table, dev_lock);

	Table hostTable;
	copy_table_to_host(table, hostTable);

	free_table(table);
	cudaFree(dev_lock);
	cudaFree(dev_keys);
	cudaFree(dev_values);
	free(buffer);

}

Robert_Crovella · March 22, 2019, 3:34am

I suggest providing the value of HASH_ENTRIES you are using, but probably doesn’t matter.

If what you are saying is true, I presume that traversing one of the built tables and printing out values of key, value should produce the same values every time (since you claim that traversing to entry->next would revert to self). However that’s not what I see. It’s not obvious to me what problem you are having. Perhaps you should provide a printout and a complete code to make it crystal clear to others.

$ cat t421.cu
#include <stdio.h>
#define imin(a,b) (a<b?a:b)
const int N = 1024 * 1024;
const int threadsPerBlock = 256;
const int blocksPerGrid = imin(32, (N + threadsPerBlock - 1) / threadsPerBlock);
struct Lock {
        int *mutex;
        Lock(void) {
                int state = 0;
                cudaMalloc((void**)&mutex, sizeof(int));
                cudaMemcpy(mutex, &state, sizeof(int), cudaMemcpyHostToDevice);
        }
        __device__ void lock(void) {
                while (atomicCAS(mutex, 0, 1) != 0);
                __threadfence();
        }
        __device__ void unlock(void) {
                atomicExch(mutex, 0);
                __threadfence();
        }
};
struct Entry{
        unsigned int key;
        float value;
        Entry *next;
};

struct Table{
        size_t count;
        Entry **entries;
        Entry *pool;
};
__device__ __host__ size_t hash(unsigned int value, size_t count){
        return value % count;
}

void initialize_table(Table &table, int nb_entries, int nb_elements){
        table.count = nb_entries;
        cudaMalloc((void**)&table.entries, nb_entries*sizeof(Entry*));
        cudaMemset(table.entries, 0, nb_entries*sizeof(Entry*));
        cudaMalloc((void**)&table.pool, nb_elements*sizeof(Entry));
}
void free_table(Table &table)
{
        cudaFree(table.pool);
        cudaFree(table.entries);
}
void copy_table_to_host(const Table &table, Table &hostTable){
        hostTable.count = table.count;
        hostTable.entries = (Entry**)calloc(table.count, sizeof(Entry*));
        hostTable.pool = (Entry*)malloc(N*sizeof(Entry));

        cudaMemcpy(hostTable.entries, table.entries, table.count*sizeof(Entry*),cudaMemcpyDeviceToHost);
        cudaMemcpy(hostTable.pool, table.pool, N * sizeof(Entry), cudaMemcpyDeviceToHost);

        for (int i = 0; i<N; i++){
                if (hostTable.pool[i].next != NULL)
                        hostTable.pool[i].next = (Entry*)((size_t)hostTable.pool[i].next - (size_t)table.pool + (size_t)hostTable.pool);
        }
        for (int i=0;i<table.count;i++){
                if (hostTable.entries[i] != NULL)
                        hostTable.entries[i] = (Entry*)((size_t)hostTable.entries[i] - (size_t)table.pool + (size_t)hostTable.pool);
        }
}
__global__ void add_to_table(unsigned int *keys, float *values, Table table, Lock *lock)
{
        int tid = threadIdx.x + blockIdx.x*gridDim.x;
        int stride = blockDim.x*gridDim.x;
        while (tid<N){
                unsigned int key = keys[tid];
                size_t hashValue = hash(key, table.count);
                for (int i=0;i<32;i++)
                {
                        if ((tid % 32) == i)
                        {
                                Entry *location = &(table.pool[tid]);
                                location->key = key;
                                location->value = values[tid];
                                lock[hashValue].lock();
                                location->next = table.entries[hashValue];
                                table.entries[hashValue] = location;
                                lock[hashValue].unlock();
                        }
                }
                tid += stride;
        }
}
#define HASH_ENTRIES 10
void hash_table_test()
{
        unsigned int *buffer = (unsigned int*)malloc(N*sizeof(unsigned int));
        for (int i = 0; i<N; i++) buffer[i] = (unsigned int)i;
        float *values;
        values = (float*)malloc(N*sizeof(float));
        for (int i = 0; i<N; i++)
                values[i] = (float)i*0.01;

        unsigned int *dev_keys;
        float *dev_values;

        cudaMalloc((void**)&dev_keys, N * sizeof(unsigned int));
        cudaMalloc((void**)&dev_values, N * sizeof(float));
        cudaMemcpy(dev_keys,buffer, N * sizeof(unsigned int),cudaMemcpyHostToDevice);
        cudaMemcpy(dev_values, values, N * sizeof(float), cudaMemcpyHostToDevice);

        Table table;
        initialize_table(table, HASH_ENTRIES, N);

        Lock lock[HASH_ENTRIES];
        Lock *dev_lock;

        cudaMalloc((void**)&dev_lock, HASH_ENTRIES * sizeof(Lock));
        cudaMemcpy(dev_lock, lock, HASH_ENTRIES * sizeof(Lock), cudaMemcpyHostToDevice);

        add_to_table << < blocksPerGrid, threadsPerBlock >> > (dev_keys, dev_values, table, dev_lock);


        Table hostTable;
        copy_table_to_host(table, hostTable);
        Entry *cur = hostTable.entries[0];
        for (int i =0; i < 10; i++){
         printf("current key: %d, current value: %f\n", cur->key, cur->value);
         cur = cur->next;}

        free_table(table);
        cudaFree(dev_lock);
        cudaFree(dev_keys);
        cudaFree(dev_values);
        free(buffer);

}

int main(){

  hash_table_test();
}
$ nvcc -o t421 t421.cu
$ cuda-memcheck ./t421
========= CUDA-MEMCHECK
current key: 1041560, current value: 10415.599609
current key: 1041530, current value: 10415.299805
current key: 1041550, current value: 10415.500000
current key: 1041520, current value: 10415.200195
current key: 1041540, current value: 10415.400391
current key: 1041510, current value: 10415.099609
current key: 1033370, current value: 10333.700195
current key: 1033340, current value: 10333.400391
current key: 1033360, current value: 10333.599609
current key: 1033330, current value: 10333.299805
========= ERROR SUMMARY: 0 errors
$

In the future, if you don’t wish to provide a complete code, just as I have done here, to make it as easy as possible for others to help you, I will assume you are not that serious about wanting help.

yft306899675 · March 26, 2019, 4:21pm

Hi Robert_Crovella,

I really appreciate your suggestions. In my code #define HASH_ENTRIES is missing as well as the main function and the necessary header files. I will pay attention to the completeness of the codes when I try to find help.

These days I was fighting against this problem and tried lots of debugging methods. Finally, I found the problem is caused by a really stupid mistake. It exists also in the codes given by you in the 68th line:

int tid = threadIdx.x + blockIdx.x*gridDim.x;

The thread index of a 1 dimension grid structure should be as following:

int tid = threadIdx.x + blockIdx.x*blockDim.x;

However, I still don’t understand why you can get a result with this small mistake.

By the way, I use visual studio 2015 to build, compile and run my code. The command line in my environment is as following:

“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\bin\nvcc.exe” -dlink -o x64\Debug\hashtable.device-link.obj -Xcompiler "/EHsc /nologo /Zi "

If I use the following parameter to test the code without the mistake in 68th line, and I got the result.

#define HASH_ENTRIES 101
const int N =  1024;
const int threadsPerBlock = 256;
const int blocksPerGrid = imin(32, (N + threadsPerBlock - 1) / threadsPerBlock);

the result:
current key: 505, current value: 5.05

current key: 404, current value: 4.04

current key: 1010, current value: 10.1

current key: 303, current value: 3.03

current key: 909, current value: 9.09

current key: 202, current value: 2.02

current key: 808, current value: 8.08

current key: 101, current value: 1.01

current key: 707, current value: 7.07

current key: 0, current value: 0

yft306899675 · April 2, 2019, 10:43am

Hi Robert_Crovella,

I met some problem with the same piece of code. I changed the value of parameters such as N, threadsPerBlock and blocksPerGrid and ran the code. Some of them worked and some other did not work. In order to be clear, here, I give again the code which I used to do the test.

For example, if I used following set of parameter, the keys of all the entries linked to hostTable.entries[0] can be printed:

#define HASH_ENTRIES 101
const int N =  1024*1024;
const int threadsPerBlock = 256;
const int blocksPerGrid = 32;

However, if I changed blocksPerGrid to 64, the keys of entres can not be printed because hostTable.entries[0]==NULL;

#define HASH_ENTRIES 101
const int N =  1024*1024;
const int threadsPerBlock = 256;
const int blocksPerGrid = 64;

On the other hand, if I changed N to 210241024, I tried lots of threadsPerBlock and blocksPerGrid values, they all did not work.

#define HASH_ENTRIES 101
const int N =  2*1024*1024;
const int threadsPerBlock = 256;
const int blocksPerGrid = 32;

After the code, I give you also the parameters of my graphic card. I suspect that the problem caused by memory leak, but I did not find the real reason.

During one week and more, I was stuck in this problem.
Could you please kindly help to analyse this problem?

#include "cuda.h"
#include "cuda_runtime.h"
#include <device_launch_parameters.h>
#include <device_functions.h>

#include <stdio.h>
#define imin(a,b) (a<b?a:b)
#define HASH_ENTRIES 1024
const int N =  100*1024*1024;
const int threadsPerBlock = 256;
const int blocksPerGrid = 64;

struct Lock {
	int *mutex;
	Lock(void) {
		int state = 0;
		cudaMalloc((void**)&mutex, sizeof(int));
		cudaMemcpy(mutex, &state, sizeof(int), cudaMemcpyHostToDevice);
	}
	__device__ void lock(void) {
		while (atomicCAS(mutex, 0, 1) != 0);
		__threadfence();
	}
	__device__ void unlock(void) {
		atomicExch(mutex, 0);
		__threadfence();
	}
};
struct Entry {
	unsigned int key;
	float value;
	Entry *next;
};

struct Table {
	size_t count;
	Entry **entries;
	Entry *pool;
};
__device__ __host__ size_t hash(unsigned int value, size_t count) {
	return value % count;
}

void initialize_table(Table &table, int nb_entries, int nb_elements) {
	table.count = nb_entries;
	cudaMalloc((void**)&table.entries, nb_entries * sizeof(Entry*));
	cudaMemset(table.entries, 0, nb_entries * sizeof(Entry*));
	cudaMalloc((void**)&table.pool, nb_elements * sizeof(Entry));
}
void free_table(Table &table)
{
	cudaFree(table.pool);
	cudaFree(table.entries);
}
void copy_table_to_host(const Table &table, Table &hostTable) {
	hostTable.count = table.count;
	hostTable.entries = (Entry**)calloc(table.count, sizeof(Entry*));
	hostTable.pool = (Entry*)malloc(N * sizeof(Entry));

	cudaMemcpy(hostTable.entries, table.entries, table.count * sizeof(Entry*), cudaMemcpyDeviceToHost);
	cudaMemcpy(hostTable.pool, table.pool, N * sizeof(Entry), cudaMemcpyDeviceToHost);

	for (int i = 0; i<N; i++) {
		if (hostTable.pool[i].next != NULL)
			hostTable.pool[i].next = (Entry*)((size_t)hostTable.pool[i].next - (size_t)table.pool + (size_t)hostTable.pool);
	}
	for (int i = 0; i<table.count; i++) {
		if (hostTable.entries[i] != NULL)
			hostTable.entries[i] = (Entry*)((size_t)hostTable.entries[i] - (size_t)table.pool + (size_t)hostTable.pool);
	}
}
__global__ void add_to_table(unsigned int *keys, float *values, Table table, Lock *lock)
{
	int tid = threadIdx.x + blockIdx.x*blockDim.x;
	int stride = blockDim.x*gridDim.x;
	while (tid<N) {
		unsigned int key = keys[tid];
		size_t hashValue = hash(key, table.count);
		for (int i = 0; i<32; i++)
		{
			if ((tid % 32) == i)
			{
				Entry *location = &(table.pool[tid]);
				location->key = key;
				location->value = values[tid];
				lock[hashValue].lock();
				location->next = table.entries[hashValue];
				table.entries[hashValue] = location;					
				lock[hashValue].unlock();
			}
		}
		tid += stride;
	}
}

void hash_table_test()
{
	unsigned int *buffer = (unsigned int*)malloc(N * sizeof(unsigned int));
	for (int i = 0; i<N; i++) buffer[i] = (unsigned int)i%(1024*1024);
	float *values;
	values = (float*)malloc(N * sizeof(float));
	for (int i = 0; i<N; i++)
		values[i] = (float)((i % ( 1024*1024))*0.01);

	printf("N=%d\n",N);
	printf("threadsPerBlock=%d\n", threadsPerBlock);
	printf("blocksPerGrid=%d\n", blocksPerGrid);

	cudaEvent_t start, stop;
	cudaEventCreate(&start);
	cudaEventCreate(&stop);
	cudaEventRecord(start, 0);

	unsigned int *dev_keys;
	float *dev_values;

	cudaMalloc((void**)&dev_keys, N * sizeof(unsigned int));
	cudaMalloc((void**)&dev_values, N * sizeof(float));
	cudaMemcpy(dev_keys, buffer, N * sizeof(unsigned int), cudaMemcpyHostToDevice);
	cudaMemcpy(dev_values, values, N * sizeof(float), cudaMemcpyHostToDevice);

	Table table;
	initialize_table(table, HASH_ENTRIES, N);

	Lock lock[HASH_ENTRIES];
	Lock *dev_lock;

	cudaMalloc((void**)&dev_lock, HASH_ENTRIES * sizeof(Lock));
	cudaMemcpy(dev_lock, lock, HASH_ENTRIES * sizeof(Lock), cudaMemcpyHostToDevice);

	add_to_table << < blocksPerGrid, threadsPerBlock >> > (dev_keys, dev_values, table, dev_lock);

	Table hostTable;
	copy_table_to_host(table, hostTable);

	Entry *head;
	int count = 0;
	for (head = hostTable.entries[0]; head != NULL; head = head->next)
	{
		printf("%dth entry's key=%d\n", count++, head->key);
	}

	cudaEventRecord(stop, 0);
	cudaEventSynchronize(stop);

	float elapsedTime;
	cudaEventElapsedTime(&elapsedTime, start, stop);
	printf("Time to hash: %3.1f ms\n", elapsedTime);

	cudaEventDestroy(start);
	cudaEventDestroy(stop);

	free_table(table);
	cudaFree(dev_lock);
	cudaFree(dev_keys);
	cudaFree(dev_values);
	free(buffer);

}

int main() {

	hash_table_test();
}

The device properties are as following:

Detected 1 CUDA Capable device(s)

Device 0: “GeForce GTX 880M”
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 8192 MBytes (8589934592 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 993 MHz (0.99 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1, Device0 = GeForce GTX 880M
Result = PASS

Robert_Crovella · April 4, 2019, 2:12am

First of all, I’m not able to reproduce the issue. I’ve tried the following combinations:

#define HASH_ENTRIES 1024
const int N =  100*1024*1024;
const int threadsPerBlock = 256;
const int blocksPerGrid = 64;

and:

#define HASH_ENTRIES 1024
const int N =  2*1024*1024;
const int threadsPerBlock = 256;
const int blocksPerGrid = 64;

and:

#define HASH_ENTRIES 101
const int N =  2*1024*1024;
const int threadsPerBlock = 256;
const int blocksPerGrid = 64;

I am using Tesla P100 on CUDA 10 on linux. I don’t have a GTX880M on CUDA 9 on windows, or anything close to it. In all cases the program generates plausible printout and terminates normally without errors.

Instead of saying the program “didn’t work” it would be more helfpful if you describe the behavior exactly. Did it print out anything? Did it terminate normally or hang? When you ran it under cuda-memcheck, did you observe any errors?

I don’t observe any runtime errors but demonstrating proper CUDA error checking, especially when you are having trouble with a code, is good practice.

You appear to be running on windows. If a kernel exceeds a particular duration, usually 2 seconds, the WDDM system will time out and halt the kernel and reset the GPU. if you haven’t addressed this possibility, you should do so.

Finally, this sort of lock programming can be hazardous. At first glance you seem to have avoided the usual hazard, but I could probably write a paragraph or two on the next hazard your code faces. I’d rather not get into that now, as its not clear to me that it is the source of your problem, or what your problem is exactly.

Topic		Replies	Views
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2632	August 2, 2010
Device memory access bug? CUDA Programming and Performance	29	2069	June 16, 2011
GPU - CPU Performance comparison on string conversion i7 860 3.5GHz beat out NVidia 9800 GT CUDA Programming and Performance	11	10653	January 4, 2011
cudaMemcpy() returns success and copy incorrect data CUDA Programming and Performance	3	2098	March 4, 2017
CUDA kernel for-loop performance CUDA Programming and Performance	16	6013	September 7, 2019
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	50950	June 7, 2022
Urgent help with threads please! CUDA Programming and Performance	21	10784	March 6, 2008
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64100	April 20, 2011
Generating 62^8 WiFi password in 0.5s I need cloud for testing CUDA Programming and Performance	35	662	March 10, 2023
Cant modify data on the GPU CUDA Programming and Performance	16	10239	December 20, 2008

Copy hash table from GPU to CPU

Related topics