Performance issue when calling cudaMallocManaged many times?


I am trying to make an array of unsigned char*. So far I have been able to accomplish that doing this on the host:

for (int x = 0; x < seeds.size(); x++) { // Convert all seeds to bytes
		std::vector<unsigned char> tempStore(seeds[x].begin(), seeds[x].end());
		sizesOfBytes[x] = tempStore.size();

		thrust::device_ptr<unsigned char> pStore = thrust::device_malloc<unsigned char>(tempStore.size());
		thrust::device_ptr<unsigned char> pHash = thrust::device_malloc<unsigned char>(64);

		thrust::copy(tempStore.begin(), tempStore.end(), pStore);

Later, I create an array that holds multiple unsigned char* and allocate it on the device using cudaMallocManaged. Then I copy all the pointers inside hostDStore into that array.

unsigned char** pHashes = nullptr;
	cudaMallocManaged(&pHashes, hashDStore.size() * sizeof(unsigned char*));
	thrust::copy(hashDStore.begin(), hashDStore.end(), pHashes);

This works and I am able to pass this pointer to device functions.

My problem is that it takes a very long time repeatedly calling thrust::device_malloc.

How could I solve this problem? I am also a beginner so any guidance is appreciated.

Make a large allocation of sufficient size, compute the pointers for individual chunks yourself using basic pointer arithmetics. The concept is called memory pooling.