How to release register

Hi I’m a noob to cuda programming. I’m working on a monolithic kernel function with 300+ lines of code and several device functions called by this kernel. The overall code running on each thread mounts up to 700+ lines. My graphic card is GeForce GTX 1060 3Gb that comes with 9 SMs and 65536 registers per SM. I noticed the limit to the number of launched threads in the same time is around 3000 while I have to parallel 8000+ tasks. So I splitted my 8000+ tasks into 10 segments in a for loop. I thought registers should be automatically released once I return and synchronize the device. But I still observed register spilling at 4th loop which seems to be equivalent to launching 3000 threads in one stroke. Please tell me how can I evade register spilling without reset the whole device.

The compiler does this for you. It will automatically release and reuse registers. If you have register pressure, its because the compiler believes that that level of register usage is best for performance.

There isn’t anything you need to do (or can do directly) to “release” registers. A number of your statements are confusing to me, but this is often the case in my experience when there is no code to look at.

Hi Robert,

Thank you for the prompt reply! Part of my code is as follow

  1. cudasafe(cudaMalloc((void**)&coef_det_d, spline_x* spline_y* spline_z* num_coef_per_pix * sizeof(float)), “Mem alloc for PSF_det failed.”, LINE);

  2. cudasafe(cudaMalloc((void**)&coef_exc_d, spline_z* num_coef_per_pix_axial * sizeof(float)), “Mem alloc for PSF_exc failed.”, LINE);

  3. cudasafe(cudaMalloc((void**)&data_d, seg_size* seg_size* slice_num* emitter_num * sizeof(float)), “Mem alloc for seg_data failed.”, LINE);

  4. cudasafe(cudaMalloc((void**)&offset_map_d, cam_map_size* cam_map_size * sizeof(float)), “Mem alloc for offset_map failed.”, LINE);

  5. cudasafe(cudaMalloc((void**)&var_map_d, cam_map_size* cam_map_size * sizeof(float)), “Mem alloc for var_map failed.”, LINE);

  6. cudasafe(cudaMalloc((void**)&gain_map_d, cam_map_size* cam_map_size * sizeof(float)), “Mem alloc for gain_map failed.”, LINE);

  7. cudasafe(cudaMalloc((void**)&map_ptr_x_d, emitter_num * sizeof(float)), “Mem alloc for LUT_x failed.”, LINE);

  8. cudasafe(cudaMalloc((void**)&map_ptr_y_d, emitter_num * sizeof(float)), “Mem alloc for LUT_y failed.”, LINE);

  9. cudasafe(cudaMalloc((void**)&fitting_para_d, fit_para_num * emitter_num * sizeof(float)), “Mem alloc for fitting_parameters failed.”, LINE);

  10. cudasafe(cudaMalloc((void**)&CRLBs_d, fit_para_num * emitter_num * sizeof(float)), "Mem alloc for CRLB failed.", __LINE__);
    
  11. cudasafe(cudaMalloc((void**)&LogLikelihood_d, emitter_num * sizeof(float)), "Mem alloc for log_likelihood failed.", __LINE__);
    
  12. cudasafe(cudaMalloc((void**)&device_debug_d, emitter_num * 100 * sizeof(float)), "Mem alloc for device_debug failed.", __LINE__);
    
  13. cudasafe(cudaMalloc((void**)&para_config_d, 3*sizeof(int)), "Mem alloc for num_para failed.", __LINE__);
    
  14. cudasafe(cudaMemcpy(coef_det_d, coef_det_h, spline_x* spline_y* spline_z* num_coef_per_pix * sizeof(float), cudaMemcpyHostToDevice), "Memory for PSF_det copy failed", __LINE__);
    
  15. cudasafe(cudaMemcpy(coef_exc_d, coef_exc_h, spline_z* num_coef_per_pix_axial * sizeof(float), cudaMemcpyHostToDevice), "Memory for PSF_exc copy failed", __LINE__);
    
  16. cudasafe(cudaMemcpy(data_d, data_h, seg_size* seg_size* slice_num* emitter_num * sizeof(float), cudaMemcpyHostToDevice), "Memory for seg_data copy failed", __LINE__);
    
  17. cudasafe(cudaMemcpy(offset_map_d, offset_map_h, cam_map_size* cam_map_size * sizeof(float), cudaMemcpyHostToDevice), "Memory for offset_map copy failed", __LINE__);
    
  18. cudasafe(cudaMemcpy(var_map_d, var_map_h, cam_map_size* cam_map_size * sizeof(float), cudaMemcpyHostToDevice), "Memory for var_map copy failed", __LINE__);
    
  19. cudasafe(cudaMemcpy(gain_map_d, gain_map_h, cam_map_size* cam_map_size * sizeof(float), cudaMemcpyHostToDevice), "Memory for gain_map copy failed", __LINE__);
    
  20. cudasafe(cudaMemcpy(map_ptr_x_d, map_ptr_x_h, emitter_num * sizeof(float), cudaMemcpyHostToDevice), "Memory for LUT_x copy failed", __LINE__);
    
  21. cudasafe(cudaMemcpy(map_ptr_y_d, map_ptr_y_h, emitter_num * sizeof(float), cudaMemcpyHostToDevice), "Memory for LUT_y copy failed", __LINE__);
    
  22. cudasafe(cudaMemset(fitting_para_d, 0, fit_para_num* emitter_num * sizeof(float)), "Failed cudaMemset on fitting_parameters.", __LINE__);
    
  23. cudasafe(cudaMemset(CRLBs_d, 0, fit_para_num* emitter_num * sizeof(float)), "Failed cudaMemset on CRLB.", __LINE__);
    
  24. cudasafe(cudaMemset(LogLikelihood_d, 0, emitter_num * sizeof(float)), "Failed cudaMemset on log_likelihood.", __LINE__);
    
  25. cudasafe(cudaMemset(device_debug_d, 0, emitter_num * 100 * sizeof(float)), "Failed cudaMemset on device_debug.", __LINE__);
    
  26. // cuda_kernel start 
    
  27. dim3 dimBlock = block_size;  //256 threads per block   index from 0 to 255
    
  28. dim3 dimGrid; 
    
  29. for (int iter = 0; iter < calc_seg; iter++)
    
  30. {
    
  31. 	int emitter_ini = 1;
    
  32. 	for (int j = 0; j < iter; j++)
    
  33. 		emitter_ini += calc_seg_length[j];
    
  34. 	int cur_seg_size = calc_seg_length[iter];
    
  35. 	*(para_config_h + 1) = emitter_ini;
    
  36. 	*(para_config_h + 2) = cur_seg_size;
    
  37. 	dimGrid = ceil((float)cur_seg_size / (float)block_size);
    
  38. 	*para_config_h = fit_para_num;
    
  39. 	cudasafe(cudaMemcpy(para_config_d, para_config_h, 3 * sizeof(int), cudaMemcpyHostToDevice), "Memory for num_para copy failed", __LINE__);    // LS offset estimate and initialize fitting parameter
    
  40. 	cuda_fitting(dimGrid, dimBlock, para_config_d, coef_det_d, coef_exc_d, data_d, offset_map_d, var_map_d, gain_map_d, map_ptr_x_d, map_ptr_y_d, fitting_para_d, CRLBs_d, LogLikelihood_d, device_debug_d);
    
  41. 	cudasafe(cudaDeviceSynchronize(), "sync failed", __LINE__);
    
  42. 	*para_config_h = fit_para_num - 1;
    
  43. 	cudasafe(cudaMemcpy(para_config_d, para_config_h, 3*sizeof(int), cudaMemcpyHostToDevice), "Memory for num_para copy failed", __LINE__);
    
  44. 	cudasafe(cudaMemset(device_debug_d, 0, emitter_num * 100 * sizeof(float)), "Failed cudaMemset on device_debug.", __LINE__);   // fine localization
    
  45. 	cuda_fitting(dimGrid, dimBlock, para_config_d, coef_det_d, coef_exc_d, data_d, offset_map_d, var_map_d, gain_map_d, map_ptr_x_d, map_ptr_y_d, fitting_para_d, CRLBs_d, LogLikelihood_d, device_debug_d);
    
  46. 	cudasafe(cudaDeviceSynchronize(), "sync failed", __LINE__);
    
  47. }
    
  48. cudasafe(cudaMemcpy(fitting_para_h, fitting_para_d, fit_para_num * emitter_num * sizeof(float), cudaMemcpyDeviceToHost),"cudaMemcpy failed for fitting_parameters.", __LINE__);
    
  49. cudasafe(cudaMemcpy(CRLBs_h, CRLBs_d, fit_para_num * emitter_num * sizeof(float), cudaMemcpyDeviceToHost), "cudaMemcpy failed for CRLB.", __LINE__);
    
  50. cudasafe(cudaMemcpy(LogLikelihood_h, LogLikelihood_d, emitter_num * sizeof(float), cudaMemcpyDeviceToHost), "cudaMemcpy failed for log_likelihood.", __LINE__);
    
  51. cudasafe(cudaMemcpy(device_debug_h, device_debug_d, emitter_num * 100 * sizeof(float), cudaMemcpyDeviceToHost), "cudaMemcpy failed for device_debug.", __LINE__);
    
  52. // cuda_kernel end
    

From line 1 to line 25 I allocate global memory.
Line 26 to 47 is the main loop to call my kernel function cuda_fitting, which is an external c wrapper invoking global function defined in .cu file. The grid dimension is dynamically calulated and then used in cuda_fitting function. In each cycle cuda_fitting function is invoked twice with around 800 kernels launched. I assume compiler automatically release and reused registers every time when cuda_fitting function returns. But in reality after 3 cycles I have register spilling problem exactly in the same way as I launched 2400 kernels in the same time!

Questions seeking assistance with debugging or optimization should include a minimal complete example that reproduces the issue of interest. In other words, code that others can cut from the forum post, paste into their favorite editor or IDE, compile and run to locally reproduce whatever the issue at hand is.

Think about collaborative internet debugging / tuning like getting your car fixed. A car mechanic needs access to the entire vehicle in order to reproduce the customer issue before they can diagnose and fix the problem. Showing up at the mechanics with a quarter of a car (“I think the problem is in the driver-side front of the car, so I brought that portion”) and describing the rest of the car to them generally won’t work.

I second what njuffa said. The above statement doesn’t make sense to me. register spills are something that are observable/detectable/identifiable at compile time. Whether or not registers are spilled has nothing to do with when and how often you call kernels. If a kernel is going to spill registers, you know that will happen when you compile the kernel. If it is going to spill registers, it will do so every time you call the kernel, assuming the relevant code paths are executed by the kernel.

Perhaps I didn’t really understand what register spilling means and wrongly used jargon. My problem is if I launch more than 2400 kernels in the same time the kernel function doesn’t execute and the host code is forced to terminate without any error message thrown out. Only when I decrease the number of kernels does the code execute successfully. Therefore I just segmented my data and calculate each segment (800 kernels per segment) stepwise in a for loop. But my code will be forced to quit after 2400 calculations either I stepwise launch 800 kernels three times or launch 2400 kernels for one time. I have solved this problem by resetting device and re-allocate global memory at the end of each loop. There must be other elegant way to solve this problem. It is highly device dependent that if I use a more powerful GPU this problem won’t occur. That is why I think the problem is related to register spilling in the first place. Do you have any clue to this problem?

No I don’t have any clue without seeing a complete example. When posting code, please don’t use a numbered list like you have shown. A simple set of instructions are as follows:

  • copy and paste your code into the edit window
  • select the code
  • click the </> button at the top of the edit pane
  • save your edits