PitchBook August 20, 2015
In an effort to shed light on the interesting and cutting-edge ways our research, data science and product teams are leveraging advanced computing and technology to hone our platform, we will be producing an ongoing series of more technical posts. The first post in our series explained how we are working to improve the keyword search experience. Please enjoy our second offering on the use of multi-processing and multi-threading to improve productivity:
Multi-processing and multi-threading are two great ways to significantly increase productivity by utilizing every ounce of processing power available to the computer. If you have a machine that contains a processor with multiple cores, you can take advantage of every single core through a few simple techniques.
Both multi-processing and multi-threading are terms that are usually thrown together when discussing the idea of computations running in parallel. They both relate to one another and are within the realm of parallel processing, however they are two separate concepts.
To truly understand how multi-processing works, we must first understand the process itself. When you turn on your computer, you are presented after a short time with a graphical environment known as a desktop. You fire up your email to begin checking important news, events, etc. Everything that happens from the moment you turn on your computer to opening an email is the result of various processes and tasks. These tasks vary from starting the operating system, managing your emails, connecting to the internet or playing music through your speakers, and they’re all managed through the operating system installed on your computer (e.g. Windows, Mac, Linux). The operating system is smart enough to decide which tasks are most important and which tasks only need a small amount of attention. This allows you to do multiple things on your computer without having to wait for one task to complete.
Since we have an idea of what a process is and how an operating system helps us to have multiple processes running at once, we can now dive into threads. Threads are very similar to processes: they help to do multiple things at once. The main caveat is that threads run within a process. That means that a process can have multiple tasks running within itself! Let’s go back to the email example: say you are scrolling through your email, clicking through some great ads when a notification alerts you that you have a new message. How were you able to receive an email while loading an email to read? In the background, your email client started a thread that would check at a certain interval to see if you received any new messages. This allows you to browse your email and see new messages in real-time without having to invoke the action of checking for new messages yourself.
Alright, now that we have an idea of multi-processing and multi-threading and how they relate to one another, let’s dive into some real-world examples where we could use one or both of these techniques to see some real performance improvements.
Basic Web Crawling
A great application for running multiple tasks is crawling the web. There are many use cases for crawling the web: search indexing, caching, gathering links, surveys, analytics, etc. The internet is a huge place with millions of links to follow. If we were to have a single process and a single thread, we would only be able to visit one link at a time. If we were to visit 1,000 links with an average response time of three seconds, it would take us 3,000 seconds to crawl all the links! Luckily, there are plenty of tools at our disposal to help reduce that to a more reasonable time.
Just for fun, let’s see how long it will take to gather response codes from 1,000 websites:
That took quite a while to complete.
If we step through the code, we’ll notice a few things:
Google indexes, crawls and caches the internet daily, gathering enormous amounts of data in a relatively quick fashion. They are able to achieve this feat by using multiple tools. We can get a little closer to Google speed by introducing multi-threading into our application.
Before we spawn a ton of threads and crawl the entire internet, we will need to understand how to create and start a very basic thread:
This is a very simple way of creating two independent threads. The threading module provided by Python contains very useful functions for starting and manipulating threads. The basic process of creating a thread is as follows:
Now that we have a basic understanding of threads and how to create them in Python, we can apply this knowledge to our 1,000 link crawler:
Wow! What a difference threading makes! We were able to reduce our crawling time from about 10 minutes to just under 5 minutes!
This program looks a lot like our basic thread example that creates two threads. In this example you will notice two calls to the join() method. This method tells the program to wait for the thread to finish execution before continuing. By implementing joins within our threads, we can wait for specific events to happen before continuing (e.g. waiting on calculations, asynchronous code). In this example the joins will wait for the threads to finish crawling all the links in the lists.
By splitting up our task into two independent threads of execution, we saw a 2x speed increase in completion time. By adding more threads, we can further reduce the processing time needed to crawl 1,000 websites.
We could continue creating more and more thread objects to do the crawling, however Python provides a few simple functions to easily run a task within multiple threads:
Wait one second, you said we were going to create more threads? Why is the multiprocessing module imported and notthreading? Don’t worry, we are still using threads. The multiprocessing.dummy module has a few helpful functions that will work with threads.
In the above code we are introduced to a new object and a new helper function:
By creating a pool of four threads, we saw an even greater speed increase. If we continue to add more threads to the pool, we can see faster and faster crawl times:
As you can see by the time differences, the crawl times decrease when we continue to add more threads. There is a point where the crawling time begins to lessen at a slower rate. Every computer has a “sweet spot” when it comes to efficiency with thread pools. For this author’s machine, the most efficient thread pool size for crawling websites in this scenario is around 25 threads.
With multi-threading we can take a repetitive task and apply it to thousands of things all while efficiently using the computer’s processing power. The computer is a very powerful device and multi-threading can help unleash its full potential.
*Note on Python and multi-threading: CPython is limited by what is known as the Global Interpreter Lock (GIL). The GIL prevents true native threading in Python. The GIL does not prevent multithreaded programs that use long-running or blocking processes such as I/O. Since the GIL does not affect threading on I/O, we were able to utilize threads to crawl the web since all the calls are network I/O operations that block until a response is returned. For more information on the GIL visit https://wiki.python.org/moin/GlobalInterpreterLock.
Want to learn how the PitchBook Platform can help your business? Contact us today.
© 2018 PitchBook Data. All rights reserved. PitchBook is a financial technology company that provides data on the capital markets.