Can I use Independent Thread Scheduling and Cooperative Groups with Cuda 9 + Pascal


As you see from the title, my question is that Can I use Independent Thread Scheduling and Cooperative Groups with Cuda 9 + Pascal ? Or they are volta hw specific features?

CUDA 9 RC is available starting today for download to registered developers. It includes an updated C programming guide which addresses these topics:

  1. Independent Thread Scheduling depends on Volta hardware features - refer to CUDA 9 RC programming guide p247, sec. H.6.2:

“The Volta architecture introduces Independent Thread Scheduling among threads in a warp,…”

Also refer to section 3.1.4 for additional details on opt in/out of using this feature on Volta

  1. Cooperative groups is not limited to Volta. Some aspects of it can work on Pascal and previous architectures. Some features of cooperative groups depend on specific hardware support, and the programming guide details the method to query this of the underlying hardware. I suggest reading appendix C of the new programming guide.

First, where can I find the updated toolkit documentation? All links bring me to the 8.0 version.

Second, on the topic of the OP - am I understanding correctly that only active threads can be synchronized across different blocks? That is, if I can’t fit all my blocks in the SM’s on my GPU at the same time, then I cannot synchronize every thread created by a kernel?

The documentation is installed with the CUDA toolkit. For EA and RC, no public HTML (by NVIDIA) repository exists. You must use the PDF or HTML docs that are installed with the CUDA toolkit. ON a standard linux install, they will be in /usr/local/cuda-9.0/doc

For your second question, please read the doc.

My understanding at this time: That feature is available in a “cooperative launch” and it requires special setup and supporting hardware. It does require co-residency of all threads/blocks participating in the synchronization, but in the case of concurrent kernels, this only needs to be “possible” for the kernel in question. The cooperative launch mechanism will bring about co-residency if needed to support the grid-wide sync Read section C.3 in the doc.

This is a brand new feature, so it’s possible that I may need to revise my comments.

I am trying to use the cooperative kernel with cuda 9.0 on pascal. However, I cannot compile the following code.

__global__ void kernel( int* array, int* array2 , int scalar) {
	cooperative_groups::grid_group g = cooperative_groups::this_grid();
	int tid = g.thread_rank();
	// some code 
	g.sync(); // Sync whole grid
	// some code 

int main() {
	// some code 
	int N =65536;
	cudaDeviceProp deviceProp;
	cudaGetDeviceProperties(&deviceProp, dev);
	void **args;
	args = malloc(sizeof(void*) * 3);
	args[0] = array1;
	args[1] = array2;
	args[2] = N;
	cudaLaunchCooperativeKernel( (void*) FattyFatKernel, deviceProp.multiProcessorCount, 128, args)