Beautifulsoup, Multiprocessing, and CUDA?


I am an amateur programmer, and have written a python program which, in a loop, scans websites using beautifulsoup and runs if/then statements on the html it returns.

I was directed to CUDA looking at the 1950X, which promises to reduce the processing time of the loop considerably, however, I am looking for even more speed, and in the case of multi-processing where it spits out different programs (in this case, one for each website) until my cpu is at max, I need to know if applying CUDA to my program would be valuable. I do not fully understand what I would be working with, but I am determined to learn.

Ultimately, what I need to know when building a dedicated rig for this application is whether or not I should put a graphics card on it at all, or if so, how many. Will CUDA let me slave GPUs to expand my pool of multi-processes? Let me know.


Have you run a CPU profiling on your existing python based solution to see where it spends the most CPU time? You’d need to run a profiling tool that also captures any overhead created by the python interpreter and runtime environment. Is it maybe I/O or memory bound and not even maxing out the CPU’s processing capabilities?

I doubt you could feed data to the GPU fast enough for it to be worthwhile offloading the processing. And other parts of the pipeline would likely be the dominant time sinks (Ahmdahl’s law applies to your entire processing pipeline).

Unless you have identified the processing of the extracted data to be your main bottleneck, don’t even think about offloading it to the GPU ;)


Hey thanks for the reply,

I have not attempted cpu profiling- I’ve only gone so far as to check my task manager telling me the CPU is at 100%. Shortly, I will look into the suggestions in your second paragraph, but to address your third, you’re saying that the only way a gpu would be viable is if my if statements are a greater strain on the system than the act of reaching my HTML? If that is the case, I would want my cpu to get the HTML, then send it off to the gpu for the if checks?

What I am trying to say is that if it takes 99% of the total runtime to collect HTML and extract data, and only 1% is spent on the GPU to process the data - then it’s not going to provide much of a speedup to the overall system.

Your decisions based on the provided data would have to be arithmetically intensive so that they currently form the main bottleneck on the CPU. And the decision problem has to be embarassingly parallel to benefit from the CUDA architecture. Only then it is really worthwhile to offload to GPU.

Be aware that GPUs aren’t the most powerful string and text processors by design, so they would struggle to analyze raw HTML efficiently (parsing HTML is actually more of a sequential algorithm and not embarassingly parallel)

As cbuncher1 pointed out, optimizations efforts need to be targeted at specific identifiable performance bottlenecks, and profiling of the application is required to find those.

You would also want to keep in mind that GPUs are designed as high-throughput devices, not as low-latency devices. If an application requires primarily low latency (e.g. high frequency trading), a high-frequency CPU may well be the best hardware platform to use.

Certain parsing tasks can be somewhat parallelized by approaches that use more redundant operations, the result of many of which will be discarded at a later processing stage. This will lower efficiency, but can still be a win in practical terms if the raw throughput is much higher. Whether this scenario might also apply to processing web pages I do not know.

Thanks guys, I think your advice will give me what I need to move forward. Cheers.

I have no idea what beautifulsoup is, but the use of Python, as opposed to a somewhat lower-level programming language, suggests that there is room for performance improvements.

Well I ran some tests and each process takes about 2.7 seconds, with the process of getting the HTML taking 2.65 seconds and the checks taking only .05 seconds total. As I understand it, the gpu isn’t effective at the task at hand, defining a variable as the entire HTML block.

Given that I am running 20 of these ~3 second proesses at once, I mat be better suited dividing the total number of processes across multiple PCs to achieve the efficiency I am looking for?

What does “getting” mean here? Receiving the HTML data over a network? If so, that is an I/O bottleneck, and you might want to look into a faster internet connection (e.g. my internet provider offers multiple tiers from 10 Mbps up to 2 Gbps), or a faster LAN (e.g. 10G Ethernet) depending on where the data originates.

Pardon my terminology, I still have much to learn. Here is the python code that ‘gets’ the html.

source = urllib.request.urlopen(l[y]).read()
#this line defines my website

soup = bs.BeautifulSoup(source,‘lxml’)
#reads the html

for row in soup.html.body.find_all(‘table’, attrs={‘class’:‘table-1’}):
#searches the html for a table
for i,j in zip(row.find_all(‘a’), row.find_all(‘td’, attrs={‘width’:‘130’, ‘align’:‘right’})):
#searchers the table for specific tags
>run ifs

These lines of code here take 2.65 seconds on average to process. All it is doing is pulling a url number(y) from a list(l), reading it and then finding the tags within a table.
my internet speed is 90 Mbps, and when I have 20 programs simultaneously running the below code AND stress my internet connection by downloading a video game at 11Mbps , the time it takes to run the entire loop is suffers only a 15% reduction in speed. given that 99% of the time taken in each process I spawn is spent receiving the html, would that not indicate that internet speed is no factor? that the 15% could have come from the work the cpu is doing allocating downloaded videogame?