I’m not having any success either. Can anyone spot my problem with this?
Currently, as a small test, I have an area of memory which I’m trying to download, process and return to host asynchronously. The kernel I’m running is simply a *2 mapping so no problems there, but I’m getting unspecified launch failures and errors looking like this:
First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError at memory location 0x1c1efe20..
First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError at memory location 0x1c1efdd0..
First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError at memory location 0x1c1efdd0..
First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efdc8..
First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efdc8..
First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efe20..
First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efe04..
First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efe04..
First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efe00..
First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efde8..
Question: When I’m calling the kernel, should I be calling the total number of blocks, or the total number of blocks / the number of streams?
#define MEM_SIZE 8192
#define CHUNK_SIZE 1024
#define NUM_CHUNKS 8
cutilSafeCall(cudaMallocHost((void**)&textStream, sizeof(float)*MEM_SIZE));
// allocate device memory
float* d_idata;
cutilSafeCall( cudaMalloc( (void**) &d_idata, mem_size));
cudaMemset(d_idata,0,mem_size);
float* h_idata = textStream;
// allocate device memory for result
float* d_odata;
cutilSafeCall( cudaMalloc( (void**) &d_odata, mem_size));
cudaMemset(d_odata,0,mem_size);
// allocate mem for the result on host side
float* h_odata;// = (float*) malloc( mem_size);
cutilSafeCall(cudaMallocHost((void**)&h_odata, mem_size));
memset(h_odata,0,mem_size);
some code to load data into textStream here. (NB: its not a stream, just a float array)
cudaStream_t stream[NUM_CHUNKS];
for (int i = 0; i < NUM_CHUNKS; i++) {
cutilSafeCall(cudaStreamCreate(&stream[i]));
}
int size = CHUNK_SIZE * sizeof(float);
for (int i = 0; i < NUM_CHUNKS; i++) {
cudaMemcpyAsync(d_idata + i*size, h_idata+i*size, size, cudaMemcpyHostToDevice, stream[i]);
}
for (int i = 0; i < NUM_CHUNKS; i++) {
//grid,threads,shared mem, stream
stringMatch<<< 16, 64, 256,stream[i]>>>( d_idata+i*size, d_odata+i*size);
}
for (int i = 0; i < NUM_CHUNKS; i++) {
cudaMemcpyAsync(h_odata+i*size ,d_odata+i*size,size,cudaMemcpyDeviceToHost, stream[i]);
}
cudaThreadSynchronize();
for (int i = 0; i < NUM_CHUNKS; i++) {
cudaStreamDestroy(stream[i]);
}