cudaHostAlloc question

eyalhir74 · August 28, 2019, 3:51pm

Hi,
I’m profiling this code and I see that every odd allocation of 1MB of pinned memory takes 600-800us, while every even allocation takes 4-5us.
Compiling it like this and then running with nvprof:

nvcc -m64  -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_60,code=compute_60 -lnvToolsExt host_malloc.cu  -o cudaHostAllocTest

Any idea why this is happening?

#include "cuda_runtime.h"
#include "nvToolsExt.h"
#include <stdio.h>
#include <iostream>

using namespace std;
const uint32_t colors[] = { 0xff00ff00, 0xff0000ff, 0xffffff00, 0xffff00ff, 0xff00ffff, 0xffff0000, 0xffffffff };
const int num_colors = sizeof(colors)/sizeof(uint32_t);


int main()
{
	cout << "Main" << std::endl;
	const int N = 1000;
	const size_t ALLOCATION_SIZE = 1024 * 1024;
	int *data[N];
	
	// Just to warm up the cuda runtime.
	int *pdummy;
	cudaMalloc((void **)&pdummy, 4);

	char name[100];
	nvtxEventAttributes_t eventAttrib;
	for (int i = 0; i < N; i++)
	{
		int color_id = i;
		color_id = color_id % num_colors;
		eventAttrib.colorType = NVTX_COLOR_ARGB;
		eventAttrib.color = colors;
		eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII;
		sprintf(name, "cudaHostAlloc.%d", i);
		eventAttrib.message.ascii = name;
		
		nvtxRangePushEx(&eventAttrib);
		cudaHostAlloc(&(data[i]), ALLOCATION_SIZE, cudaHostAllocPortable);
		nvtxRangePop();
		
		
	}

	for (int i = 0; i < N; i++)
	{
		data[i][i % 10] = i;
	}
	
	
	for (int i = 0; i < N; i++)
		cudaFreeHost(data[i]);
	
	cudaDeviceReset();
	
	return 0;
}

eyalhir74 · August 28, 2019, 3:58pm

Just to add:

This is what nvprof gives to 2MB buffers:
Time(%) Time Calls Avg Min Max Name
52.62% 349.47ms 500 698.94us 540.46us 983.58us cudaHostAlloc

[b]4MB:[/b]
Time(%) Time Calls Avg Min Max Name
56.69% 531.37ms 500 1.0627ms 897.13us 1.3280ms cudaHostAlloc

[b]And for 64MB:[/b]
Time(%) Time Calls Avg Min Max Name
47.39% 636.98ms 50 12.740ms 12.081ms 16.595ms cudaHostAlloc

Robert_Crovella · August 28, 2019, 4:38pm

I wouldn’t be surprised if there is some kind of underlying allocation granularity of 2MB. The even/odd variability goes away at 2MB size. If you switch to 0.5MB you see a 1-long 3-short repeating pattern. The host operating system may pin 2MB and then use that for allocations until it is exhausted. Or it may be that CUDA is doing that. You see a similar pattern in the cudaFreeHost calls also. This suggests to me that it is not CUDA but the host OS that is doing the quantization.

eyalhir74 · August 29, 2019, 12:23pm

Thanks a lot Robert!

Topic		Replies	Views
cudaHostAlloc: Pinned memory creation very slow! CUDA Programming and Performance	7	7659	January 5, 2012
cudaMallocHost() vs. malloc() 1st "cudaMallocHost()" lasts ~90ms!! CUDA Programming and Performance	5	15103	July 3, 2007
check for cudaHostAlloc Portable possibility CUDA Programming and Performance	13	2848	July 1, 2015
Memory allocation time CUDA Programming and Performance	0	1267	January 19, 2008
Problem with cudaMallocHost CUDA Programming and Performance	3	7942	April 23, 2009
CUDA 4.0 cudaHostAlloc CUDA Programming and Performance	9	1808	June 12, 2011
Very Slow CU_MEMHOST_WRITE_COMBINED Allocation CUDA Programming and Performance	8	11011	October 20, 2010
Why does cudaMallocHost takes so muck time compared to malloc? CUDA Programming and Performance	9	2208	August 26, 2011
Pinned memory size problem CUDA Programming and Performance	4	3982	December 11, 2009
Unexpected limit in cudaHostAlloc Failing to allocate large amounts of pinned/page-locked memory CUDA Programming and Performance	3	4148	December 6, 2010

cudaHostAlloc question

Related topics