Dynamic creation of streams and register reduction.

Hello,

I need some help.

I want to break a matrix in chunks, after, I proccess each chunk in a different stream. I don’t know the width of this array, I read a file and then I get the width.
When I compile the code I get the message: “error: expression must have a constant value”

I can define a lot of streams and use only the amount I need but is not fair.

Is there any way to do this?

Code:

//Open the file.
//Get the width of the matrix and save it in the variable k.

cudaStream_t stream[k]; //
for (int i = 0; i < k; ++i)
cudaStreamCreate(&stream[i]);

Other thing I want to consult is techniques to reduce the amount of registers used.
I use around 47 registers per thread and is crazy. I only have defined 10 variables in my kernel,
only two of them are floats, the rest are unsigned short.

Thanks in advance.

This

cudaStream_t stream[k];

where ‘k’ is a variable, is known as a variable-length array (VLA). Best I know, VLAs are supported by C99 but not C++. CUDA does not support VLAs. Thus the message: “error: expression must have a constant value”

As for the suprisingly high register usage, the compiler may be creating temporary registers (for example, induction variables in loops), or some of the variables in the code may require two registers, for example pointers. If this happens with the CUDA 3.2 toolchain, and the code is fairly small, feel free to post a self-contained example and I can take a look, because 47 registers does sound a bit high for ten variables. [Later] Another reason for higher than expected register usage could be the use of operations that are implemented via software routines, such as integer or floating-point division, or calls to library functions.

Hello,

Thank you for your answer, I know I can use k to define the number of my streams, I have tried with pointers an nothing happens. Is there any way to give the amount of streams during the program
execution and not to have to give a constant value before the compilation?

My code is quite long so I post the instructions I think are critical.


    float temp, minx, sigma=998, resolution=2;
short i, j, k, l, m, n, basis;

unsigned short jmpcol = (dd.width/gridDim.x) * (1+blockIdx.x);
unsigned short indy= blockIdx.x *(dd.width/gridDim.x) + threadIdx.x;

unsigned short jmprow = (dd.height/gridDim.y) * (1+blockIdx.y);	
unsigned short indx= (blockIdx.y * (dd.height/gridDim.y)) + threadIdx.y;

unsigned short thnum = (BLOCK_SIZE*threadIdx.y)+threadIdx.x;

float2 var;

__shared__ float sdg[2][BLOCK_SIZE*BLOCK_SIZE];
__shared__ int widthg;
__shared__ int widthd;
__shared__ int widthr;

n= widthg/(BLOCK_SIZE*BLOCK_SIZE);

for(l=0;l<n;l++)
{
sdg[0][thnum]=dg.elements[(BLOCK_SIZEBLOCK_SIZEl)+thnum];
sdg[1][thnum]=dg.elements[(BLOCK_SIZEBLOCK_SIZEl)+thnum+widthg];

	m = NGPSC - (BLOCK_SIZE*BLOCK_SIZE*l);

	if(m>a)
		m=a;

	for(i=indx;i<jmprow;i+=BLOCK_SIZE)
	{
		for(j=indy;j<jmpcol;j+=BLOCK_SIZE)
		{
			var=dd.elements[i*widthd+j];
			for (k=0;k<m;k++)
			{
					basis=temp/resolution;
					di.elements[i*widthd+j]+=(dr.elements[(basis*widthr)+k]		
					+dr.elements[(basis+1)*widthr+k])/2;

Thank you.

Have you tried this:
cudaStream_t *stream = (cudaStream_t *)malloc(sizeof(cudaStream_t)*k);

In order to have a look at the register usage, I would need code that is self-contained and can be compiled to a binary. To avoid a lot of code overcrowding a post, simply attach a file to it.

From looking at the snippet posted, two of the possible drivers of register pressure that I mentioned would appear to apply:

(1) There are nested loops, likely giving rise to the creation of multiple induction variables. These would be addresses / pointers, each of which would take up two registers on a 64-bit platform. Are you building for a 32-bit or a 64-bit platform ?
(2) There is a floating-point division, which in its IEEE-compliant version (default for sm_2x compilation) maps to a software subroutine. I assume this code is being compiled for Fermi (sm_2x). What happens to the register pressure when the code is compiled with -ftz=true -prec-div=false -prec-sqrt=false ?

hello,

Sorry for the delay but I was outside all the week.

Thank you Charley, I hadn’t tried this, now it works.

I attach code which uses the kernell in a TXT, I don’t know if that is what you mean.

Thanks in advance.

Code.txt (4.82 KB)