Code not accelerated using acc kernels


I’m working on parallelizing the following code snippet but the performances seem no boost. Could anyone help me with it please?

#pragma acc data copyin(bwt[:length+1], rbwt[:length+1], c[:ACCSEQ_ABC_SIZE], o[:o_len], ro[:o_len], qqs[:total_size], lengths[:sz], offsets[:sz])
		cout << "Done copyin " << get_wall_time() - start << "s" << endl;
		for (int i = 0; i < qsize; i += set_size) {
			memset(ans, 0, ans_sz * sizeof(Range));
			cout << "Starting set " << i / set_size << endl;
			const int v = i;
#pragma acc kernels loop independent copy(ans[:ans_sz])
			for (int j = 0; j < set_size; ++j) {
				if (v + j < qsize) {
					const u8_mt *query = qqs + offsets[v + j];
					int q_len = lengths[v + j];
					inexact_dfs_iter_search(query, q_len, q_len / 20, c, o, length + 1, bwt, ro,
											rbwt, ans + (v + j) * EST_RS_PER_QUERY, EST_RS_PER_QUERY);

			cout << "Done with set " << i / set_size << endl;
			parse_result(ans, ans_sz, res);

The problem of my code is I’ll have to handle a large number of qsize (81400000 or even more), and for each one of them, it’s possible there are 200 results. If I calculate all of them together, I’ll need a space of 81400000 * 200 on the device to store the results
However, the GPU does not have enough memory, so I tried to do 65536 at once (the number of set_size is 65535), in this case, each time it will compute 65536 queries and I only need 65536 * 200 space on GPU
Another problem is that if I ran the inner loop for twice, if the 1st iterations’ results are still there, so I need to zero that part of mem to start a brand new set of computation.
At the end of the loop, I want the data to be copied back to the host and I’ll parse the results and store them on a vector in the host. This is the general idea of the code snippet.

I’ve ran it using -ta=nvidia,time flag and below is part of the output:

Full running time: 601.051s

Accelerator Kernel Timing data
  main  NVIDIA  devicenum=0
    time(us): 600,382,335
    133: data region reached 2 times
        133: data copyin transfers: 10
             device time(us): total=11,441 max=2,120 min=6 avg=1,144
    142: compute region reached 1 time
        144: kernel launched 1 time
            grid: [512]  block: [128]
             device time(us): total=600,330,898 max=600,330,898 min=600,330,898 avg=600,330,898
            elapsed time(us): total=600,333,012 max=600,333,012 min=600,333,012 avg=600,333,012
    142: data region reached 2 times
        142: data copyin transfers: 10
             device time(us): total=19,882 max=2,123 min=803 avg=1,988
        155: data copyout transfers: 10
             device time(us): total=20,114 max=2,230 min=822 avg=2,011

It seems the kernel is computing all the loop sequencially? I don’t know why it should take such long time while the same code run in 40s on CPU.

Also some part of the compiler’s output with flag -Minfo=all is shown below

    132, Generating copyin(bwt[:length+1],c[:4],lengths[:sz],o[:o_len],offsets[:sz],qqs[:total_size],rbwt[:length+1],ro[:o_len])
    137, Generating copy(ans[:ans_sz])
    139, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        139, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Note the line number has changed with the previous -ta=nvidia,time flag…

Also, when run this code using OpenACC on my own laptop (NVIDIA GTX 960m, previous results from Tesla K80), I’ve got the following results.

Start loading queries
Done loading queries: 8.79765e-05s
Start loading indexing files
Length: 18
BWT: 19
RBWT: 19
C: 4
O: 4
RO: 4
Start alignment (set_size: 65536)...
Done copyin 0.183781s
Starting set 0
total/free CUDA memory: 2099642368/222101504
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 5.0, threadid=1
host:0x209c310 device:0x704040000 size:19 presentcount:1+0 line:132 name:bwt
host:0x209c330 device:0x704040200 size:19 presentcount:1+0 line:132 name:rbwt
host:0x209c350 device:0x704040400 size:32 presentcount:1+0 line:132 name:c
host:0x209c380 device:0x704040600 size:32 presentcount:1+0 line:132 name:o
host:0x209c3b0 device:0x704040800 size:32 presentcount:1+0 line:132 name:ro
host:0x209c3e0 device:0x704040c00 size:4 presentcount:1+0 line:132 name:lengths
host:0x209c400 device:0x704040e00 size:8 presentcount:1+0 line:132 name:offsets
host:0x209c450 device:0x704040a00 size:5 presentcount:1+0 line:132 name:qqs
host:0x7fdfec96a010 device:0x704140000 size:157286400 presentcount:1+0 line:137 name:ans
allocated block device:0x704040000 size:512 thread:1
allocated block device:0x704040200 size:512 thread:1
allocated block device:0x704040400 size:512 thread:1
allocated block device:0x704040600 size:512 thread:1
allocated block device:0x704040800 size:512 thread:1
allocated block device:0x704040a00 size:512 thread:1
allocated block device:0x704040c00 size:512 thread:1
allocated block device:0x704040e00 size:512 thread:1
allocated block device:0x704140000 size:157286400 thread:1
call to cuLaunchKernel returned error 2: Out of memory

Accelerator Kernel Timing data
  main  NVIDIA  devicenum=0
    time(us): 28,498
    132: data region reached 1 time
        132: data copyin transfers: 8
             device time(us): total=53 max=12 min=5 avg=6
    137: data region reached 1 time
        137: data copyin transfers: 10
             device time(us): total=28,445 max=4,763 min=1,197 avg=2,844
    137: compute region reached 1 time
        139: kernel launched 0 times
            grid: [0]  block: [0]
             device time(us): total=0 max=0 min=0

From the errors, you can find that it’s obvious that I have enough GPU mem but it reports out of mem problem…

I’ll really appreciate if someone could help. Please tell me if you have any questions about the problem.

It’s possible that the code is slow because my implementation, I’ll figure it out.

Could someone help me fix the reported out of memory problem? I’ll really appreciate it.


Could someone help me fix the reported out of memory problem? I’ll really appreciate it.

This is a hardware limit so to fix, you either need to purchase a device with more memory or reduce the amount of memory you use on the device. To reduce, you either need to use a smaller data set, split the domain over multiple devices, or block your algorithm so that a smaller portion of the domain is on the device at any given time.