Why Cudamalloc mysteriously changes host memory

Hi, I am writing a CUDA-MPI hybrid program and the external CUDA function does not work. I have a structure call our to host information, and inside this struct I have a variable call our_number_of_people. After I set this number to 50, I call cudamalloc, and the 50 get reset to 0. Why is cudamalloc touching host memory? I wrapped two print statement outside of the cudamalloc code and it returns different value…

void cuda_init(struct global_t *global, struct our_t *our, 
	struct stats_t *stats, struct cuda_t *cuda)
{
	cuda->our_size = sizeof(int) * our->our_number_of_people;
	cuda->their_size = sizeof(int) * global->total_number_of_people;
	cuda->our_states_size = sizeof(char) * our->our_number_of_people;

	// cuda memory allocation
	cudaMalloc((void**)&cuda->their_infected_x_locations_dev, cuda->their_size);
	cudaMalloc((void**)&cuda->their_infected_y_locations_dev, cuda->their_size);

	printf("Place 1 is %d \n", our->our_number_of_people);

	HANDLE_ERROR( cudaMalloc((void**)&cuda->our_x_locations_dev, cuda->our_size) );

	printf("Place 2 is %d \n", our->our_number_of_people);

	HANDLE_ERROR( cudaMalloc((void**)&cuda->our_y_locations_dev, cuda->our_size) );
	HANDLE_ERROR( cudaMalloc((void**)&cuda->our_states_dev, cuda->our_states_size) );
	HANDLE_ERROR( cudaMalloc((void**)&cuda->our_num_days_infected_dev, cuda->our_size) );

	cudaMalloc((void**)&cuda->our_num_susceptible_dev, sizeof(int));
	cudaMalloc((void**)&cuda->our_num_immune_dev, sizeof(int));
	cudaMalloc((void**)&cuda->our_num_dead_dev, sizeof(int));
	cudaMalloc((void**)&cuda->our_num_infected_dev, sizeof(int));

	cudaMalloc((void**)&cuda->our_num_infections_dev, sizeof(int));
	cudaMalloc((void**)&cuda->our_num_infection_attempts_dev, sizeof(int));
	cudaMalloc((void**)&cuda->our_num_deaths_dev, sizeof(int));
	cudaMalloc((void**)&cuda->our_num_recovery_attempts_dev, sizeof(int));

	cuda->our_num_infections_int = (int)stats->our_num_infections;
	cuda->our_num_infection_attempts_int = (int)stats->our_num_infection_attempts;
	cuda->our_num_deaths_int = (int)stats->our_num_deaths;
	cuda->our_num_recovery_attempts_int = (int)stats->our_num_recovery_attempts;

	cudaMemcpy(cuda->their_infected_x_locations_dev, global->their_infected_x_locations, cuda->their_size, cudaMemcpyHostToDevice);
	HANDLE_ERROR( cudaMemcpy(cuda->their_infected_y_locations_dev, global->their_infected_y_locations, cuda->their_size, cudaMemcpyHostToDevice) );
	HANDLE_ERROR( cudaMemcpy(cuda->our_x_locations_dev, our->our_x_locations, cuda->our_size, cudaMemcpyHostToDevice) );
	cudaMemcpy(cuda->our_y_locations_dev, our->our_y_locations, cuda->our_size, cudaMemcpyHostToDevice);
	cudaMemcpy(cuda->our_states_dev, our->our_states, cuda->our_states_size, cudaMemcpyHostToDevice);
	cudaMemcpy(cuda->our_num_days_infected_dev, our->our_num_days_infected, cuda->our_size, cudaMemcpyHostToDevice);

	cudaMemcpy(cuda->our_num_susceptible_dev, &our->our_num_susceptible, sizeof(int), cudaMemcpyHostToDevice);
	cudaMemcpy(cuda->our_num_immune_dev, &our->our_num_immune, sizeof(int), cudaMemcpyHostToDevice);
	cudaMemcpy(cuda->our_num_dead_dev, &our->our_num_dead, sizeof(int), cudaMemcpyHostToDevice);
	cudaMemcpy(cuda->our_num_infected_dev, &our->our_num_infected, sizeof(int), cudaMemcpyHostToDevice);

	cudaMemcpy(cuda->our_num_infections_dev, &cuda->our_num_infections_int, sizeof(int), cudaMemcpyHostToDevice);
	cudaMemcpy(cuda->our_num_infection_attempts_dev, &cuda->our_num_infection_attempts_int, sizeof(int), cudaMemcpyHostToDevice);
	cudaMemcpy(cuda->our_num_deaths_dev, &cuda->our_num_deaths_int, sizeof(int), cudaMemcpyHostToDevice);
	cudaMemcpy(cuda->our_num_recovery_attempts_dev, &cuda->our_num_recovery_attempts_int, sizeof(int), cudaMemcpyHostToDevice);

	// set up 1D array for cuda
	cuda->numThread = 128;
	int tempBlock = (our->our_number_of_people+cuda->numThread-1)/cuda->numThread;
	cuda->numBlock = (32 < tempBlock ? 32 : tempBlock);

	// set up cuda Random Number Generator
	cudaMalloc(&cuda->cuda_states, cuda->numThread * cuda->numBlock);
	time_t current_time;
	time(&current_time);
	rand_kernel<<<cuda->numBlock, cuda->numThread>>>(cuda->cuda_states, 
		(unsigned long)current_time);
}

CudaMalloc DOES change some host memory… but just 64 bits of it. The first argument is a pointer to a pointer, which is written to to store the location of the malloced memory.

But how are your structs defined? If you mis-defined your member types, then you might have misdeclared “their_infected_x_locations_dev” as a 32 bit int or something instead of a 64 bit pointer. Then the cudaMalloc would overwrite not just that 32 bit int, but also the NEXT structure member too.

Below is the definition of cuda_t struct…I don’t think that the pointer “their_infected_x_locations_dev” is causing the problem. The interesting part is that the first two cudamalloc works just fine, and it went wrong from the third one. I also included the main function, where cudamalloc is part of the “kernel_functions”

struct cuda_t {
    #ifdef __CUDACC__
    // allocate pointers to objects on device memory
	int *their_infected_x_locations_dev; 
	int *their_infected_y_locations_dev;

	int *our_x_locations_dev;
	int *our_y_locations_dev; 
	int *our_num_days_infected_dev;
	char *our_states_dev;

	int *our_num_susceptible_dev;
	int *our_num_immune_dev;
	int *our_num_dead_dev;
	int *our_num_infected_dev;

	int *our_num_infections_dev;
	int *our_num_infection_attempts_dev;
	int *our_num_deaths_dev;
	int *our_num_recovery_attempts_dev;

	curandState *cuda_states;

	int our_num_infections_int;
	int our_num_infection_attempts_int;
	int our_num_deaths_int;
	int our_num_recovery_attempts_int;

	int our_size;
	int their_size;
	int our_states_size;

	int numThread;
	int numBlock;
	#endif
};
int main(int argc, char ** argv) {

	int our_current_day;

	struct global_t global;
	struct our_t our;
	struct const_t constant;
	struct stats_t stats;
	struct display_t dpy;
	struct cuda_t cuda;
	
	init(&global, &our, &constant, &stats, &dpy, &argc, &argv);

	/* ALG XIV: Each process starts a loop to run the simulation for the
	 *  specified number of days */
	
	for(our_current_day = 0; our_current_day <= constant.total_number_of_days; 
		our_current_day++)
	{

		find_infected(&global, &our);

		share_infected(&global, &our);

		share_location(&global, &our);

		do_display(&global, &our, &constant, &dpy);

		throttle(&constant);

		kernel_functions(&global, &our, &constant, &stats, &cuda);

	}

	show_results(&our, &stats);

	cleanup(&global, &our, &constant, &dpy);

	exit(EXIT_SUCCESS);
}

Yep, those are correctly defined as pointers! Which rules out that theory, and makes the problem more mysterious.

I also thought about structure alignment, but those members are host pointers so it should be OK.
Unless you’re on a 32 bit OS and you’re forcing 64 bit GPU code with nvcc -m64? Which would be Bad but explain what was happening. I’m not even sure if nvcc would let you do that.

My next step would be to put some padding members before and after the pointers. Initialize them to 0xDEADBEEF or something, and see if they’re changed by the malloc. They likely will be if this problem keeps its strange behavior. Then I’d start simplfying down to as small a structure as possible until I had a tiny repro case to submit as an NVidia bug. Often during such simplification some OTHER problem is revealed, but in that case you have a new lead to explore.

If you are on a Linux or OS X system, it is also worth running your code with valgrind to check for any obvious memory errors. Whenever I see truly bizarre behavior, like what you are describing, I first go make sure there isn’t an obvious memory access error lurking in my code.

Valgrind won’t catch subtle alignment issues, like SPWorley is describing, so consider this a complementary debugging tool.

After thinking about it, I realized what the problem probably is. Are you compiling your code with multiple modules, one with nvcc and one without?

If so, those #ifdef CUDACC guards you have will cause craziness because in pure gcc or whatever you’ll have a cuda_t structure of 0 size. But the nvcc code will write into that structure, thinking it’s big, and likely overwriting data randomly in stack memory.

Get rid of the CUDACC guards entirely and see what happens.

Hi, SPWorley

I think you might be right on the CUDACC macro…Since I am compiling a CUDA-MPI hybrid, part of the code is compiled using NVCC and others are compiled with MPICC, which include the main function. I think the main function might initiate the cuda structure as a empty one as MPICC does not see the part wrapped in CUDACC

However, bugs like this should cause more damage than what I am experiencing I think… I don’t know why the first two cudamalloc works or why cudamalloc would change host memory…

The confusion is that cudaMalloc does not return a pointer like malloc() does. cudaMalloc WRITES its pointer to the host memory location you specify, which is why there is that ugly (void **)&ptr cast.
The problem is you’re giving cudaMalloc a pointer to memory you don’t own, since you never allocated it since you told the MPICC compiler that the structure has no size.

It’s a simple fix though: Get rid of the CUDACC guards completely. That will make the structure seen by nvcc and the structure seen by MPICC match and you should be OK.

Hi, SPWorley

Thank you so much for your reply. I get rid of the CUDACC and the error goes away immediately.

Many other bugs emerge after this, but I will try to tackle them later…

Thanks

Yu