cuCtxSynchronize failed A very simple function (pyCUDA)

I am new to CUDA programming. I have recently started learning itin Python using pyCUDA library.

I want to write a function which computes Wt*d, where Wt is a transposed W matrix and d is a vector (without the need of transposition itself). It looks simple, but my code does not work.

import numpy as np

import pycuda.driver as cuda

import pycuda.autoinit

from pycuda.compiler import SourceModule

if __name__ == '__main__':

	

	nn = 64 # W rows

	nw = 40 # W colums

	w = np.random.randn(nw, nn).astype(np.float32)

	d = np.random.randn(nn).astype(np.float32)

	md = SourceModule("""

	#define BLOCK_SIZE 512

	__global__ void dt(float *w, float *d, float *z, int rows, int cols)

	{

		unsigned int bx = blockIdx.x;

		unsigned int tx = threadIdx.x;

	

		unsigned int idx = bx*BLOCK_SIZE + tx;

	

		float ep = 0;

		for(unsigned int i=0; i<rows; i++)

			ep += w[idx+i*cols]*d[i];

		z[idx] = ep;

	}

	""")

	

	dt = md.get_function("dt")

	

	z = np.zeros(nw)

	dt(cuda.In(w), cuda.In(d), cuda.Out(z), np.uint16(nn), np.uint16(nw), block=(512,1,1), grid=(1,1) )

And the error is:

Traceback (most recent call last):

  File "./t04.py", line 39, in <module>

	dt(cuda.In(w), cuda.In(d), cuda.Out(z), np.uint16(nn), np.uint16(nw), block=(512,1,1), grid=(1,1) )

  File "H:\opt\python26\lib\site-packages\pycuda-0.93rc4-py2.6-win32.egg\pycuda\driver.py", line 138, in function_call

	Context.synchronize()

pycuda._driver.LaunchError: cuCtxSynchronize failed: launch failed

I don’t understand why it happens. A very similar code computing W*d (untransposed W) works just fine.

Anyone has any idea?

If you are using a 9000 or older card, I dont think you can make 512 threads at once. Try 256.

That is incorrect.
Since G80 the maximum number of threads per block has been 512.

That’s not it. I lowered the number of threads to 256 ant it still crashes with the same error. Furthermore I have written other functions with 512 threads per block and they work fine.

DeviceQuery says:

Device 0: “GeForce 9600 GT”
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 536543232 bytes
Number of multiprocessors: 8
Number of cores: 64
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.80 GHz
Concurrent copy and execution: Yes

I have tried with other functions and itseems that the problem is related to iterating through a 2-dimensional array if I want to access elements of its column. The problematic fragment of code is:

for(unsigned int i=0; i<rows; i++) ep += w[idx+i*cols]*d[i];

For example an array:

[1 2 3 4 5]

[6 7 8 9 0]

[A B C D E]

Thread 0 looks for elements 1,6 and A.

Thread 1 looks for elements 2,7 and B

And so on.

This error occurs everytime I want to iterate through an 2-D array this way.

Is there any solution to this problem?

A too OLD question.

And I meet it yet.

20 global void dt(float *w, float *d, float *z, int rows, int cols)

39 dt(cuda.In(w), cuda.In(d), cuda.Out(z), np.uint16(nn), np.uint16(nw), block=(512,1,1), grid=(1,1) )

There are datetype not matching. “np.uint16” is 16 bits, and “int” is 32 bits.
You can change “int” to “short” to fix it, or change “np.uint16” to “np.uint32”.