Use Python to Obtain A number of Information (or URLs) in Parallel | by Konrad Hafen | Sep, 2023
We dwell in a world of massive knowledge. Usually, huge knowledge is organized as a big assortment of small datasets (i.e., one massive dataset comprised of a number of recordsdata). Acquiring these knowledge is usually irritating due to the obtain (or acquisition burden). Fortuitously, with somewhat code, there are methods to automate and speed-up file obtain and acquisition.
Automating file downloads can save plenty of time. There are a number of methods to automate file downloads with Python. The best technique to obtain recordsdata is utilizing a easy Python loop to iterate by means of a listing of URLs to obtain. This serial strategy can work properly with just a few small recordsdata, however if you’re downloading many recordsdata or massive recordsdata, you’ll need to use a parallel strategy to maximise your computational sources.
With a parallel file obtain routine, you possibly can higher use your laptop’s sources to obtain a number of recordsdata concurrently, saving you time. This tutorial demonstrates the best way to develop a generic file obtain operate in Python and apply it to obtain a number of recordsdata with serial and parallel approaches. The code on this tutorial makes use of solely modules out there from the Python normal library, so no installations are required.
For this instance, we solely want the requests
and multiprocessing
Python modules to obtain recordsdata in parallel. The requests
and multiprocessing
modules are each out there from the Python normal library, so you will not must carry out any installations.
We’ll additionally import the time
module to maintain observe of how lengthy it takes to obtain particular person recordsdata and evaluate efficiency between the serial and parallel obtain routines. The time
module can be a part of the Python normal library.
import requests import time from multiprocessing import cpu_count from multiprocessing.pool import ThreadPool
I’ll exhibit parallel file downloads in Python utilizing gridMET NetCDF recordsdata that comprise day by day precipitation knowledge for the USA.
Right here, I specify the URLs to 4 recordsdata in a listing. In different purposes, you could programmatically generate a listing of recordsdata to obtain.
urls = ['https://www.northwestknowledge.net/metdata/data/pr_1979.nc', 'https://www.northwestknowledge.net/metdata/data/pr_1980.nc', 'https://www.northwestknowledge.net/metdata/data/pr_1981.nc', 'https://www.northwestknowledge.net/metdata/data/pr_1982.nc']
Every URL should be related to its obtain location. Right here, I’m downloading the recordsdata to the Home windows ‘Downloads’ listing. I’ve hardcoded the filenames in a listing for simplicity and transparency. Given your software, you could need to write code that can parse the enter URL and obtain it to a particular listing.
fns = [r'C:UserskonradDownloadspr_1979.nc', r'C:UserskonradDownloadspr_1980.nc', r'C:UserskonradDownloadspr_1981.nc', r'C:UserskonradDownloadspr_1982.nc']
Multiprocessing requires parallel capabilities to have just one argument (there are some workarounds, however we received’t get into that right here). To obtain a file we’ll must go two arguments, a URL and a filename. So we’ll zip the urls
and fns
lists collectively to get a listing of tuples. Every tuple within the checklist will comprise two parts; a URL and the obtain filename for the URL. This manner we are able to go a single argument (the tuple) that accommodates two items of knowledge.
inputs = zip(urls, fns)
Now that now we have specified the URLs to obtain and their related filenames, we’d like a operate to obtain the URLs ( download_url
).
We’ll go one argument ( arg
) to download_url
. This argument might be an iterable (checklist or tuple) the place the primary component is the URL to obtain ( url
) and the second component is the filename ( fn
). The weather are assigned to variables ( url
and fn
) for readability.
Now create a attempt assertion through which the URL is retrieved and written to the file after it’s created. When the file is written the URL and obtain time are returned. If an exception happens a message is printed.
The download_url
operate is the meat of our code. It does the precise work of downloading and file creation. We will now use this operate to obtain recordsdata in serial (utilizing a loop) and in parallel. Let’s undergo these examples.
def download_url(args):
t0 = time.time()
url, fn = args[0], args[1]
attempt:
r = requests.get(url)
with open(fn, 'wb') as f:
f.write(r.content material)
return(url, time.time() - t0)
besides Exception as e:
print('Exception in download_url():', e)
To obtain the checklist of URLs to the related recordsdata, loop by means of the iterable ( inputs
) that we created, passing every component to download_url
. After every obtain is full we’ll print the downloaded URL and the time it took to obtain.
The overall time to obtain all URLs will print in spite of everything downloads have been accomplished.
t0 = time.time()
for i in inputs:
consequence = download_url(i)
print('url:', consequence[0], 'time:', consequence[1])
print('Whole time:', time.time() - t0)
Output:
url: https://www.northwestknowledge.web/metdata/knowledge/pr_1979.nc time: 16.381176710128784
url: https://www.northwestknowledge.web/metdata/knowledge/pr_1980.nc time: 11.475878953933716
url: https://www.northwestknowledge.web/metdata/knowledge/pr_1981.nc time: 13.059367179870605
url: https://www.northwestknowledge.web/metdata/knowledge/pr_1982.nc time: 12.232381582260132
Whole time: 53.15849542617798
It took between 11 and 16 seconds to obtain the person recordsdata. The overall obtain time was rather less than one minute. Your obtain instances will differ based mostly in your particular community connection.
Let’s evaluate this serial (loop) strategy to the parallel strategy beneath.
To begin, create a operate ( download_parallel
) to deal with the parallel obtain. The operate ( download_parallel
) will take one argument, an iterable containing URLs and related filenames (the inputs
variable we created earlier).
Subsequent, get the variety of CPUs out there for processing. It will decide the variety of threads to run in parallel.
Now use the multiprocessing
ThreadPool
to map the inputs
to the download_url
operate. Right here we use the imap_unordered
technique of ThreadPool
and go it the download_url
operate and enter arguments to download_url
(the inputs
variable). The imap_unordered
technique will run download_url
concurrently for the variety of specified threads (i.e. parallel obtain).
Thus, if now we have 4 recordsdata and 4 threads all recordsdata will be downloaded on the identical time as an alternative of ready for one obtain to complete earlier than the following begins. This will save a substantial quantity of processing time.
Within the closing a part of the download_parallel
operate the downloaded URLs and the time required to obtain every URL are printed.
def download_parallel(args):
cpus = cpu_count()
outcomes = ThreadPool(cpus - 1).imap_unordered(download_url, args)
for end in outcomes:
print('url:', consequence[0], 'time (s):', consequence[1])
As soon as the inputs
and download_parallel
are outlined, the recordsdata will be downloaded in parallel with a single line of code.
download_parallel(inputs)
Output:
url: https://www.northwestknowledge.web/metdata/knowledge/pr_1980.nc time (s): 14.641696214675903
url: https://www.northwestknowledge.web/metdata/knowledge/pr_1981.nc time (s): 14.789752960205078
url: https://www.northwestknowledge.web/metdata/knowledge/pr_1979.nc time (s): 15.052601337432861
url: https://www.northwestknowledge.web/metdata/knowledge/pr_1982.nc time (s): 23.287317752838135
Whole time: 23.32273244857788
Discover that it took longer to obtain every particular person file with the strategy. This can be a results of altering community velocity, or overhead required to map the downloads to their respective threads. Although the person recordsdata took longer to obtain, the parallel technique resulted in a 50% lower in whole obtain time.
You possibly can see how parallel processing can vastly cut back processing time for a number of recordsdata. Because the variety of recordsdata will increase, you’ll save rather more time through the use of a parallel obtain strategy.
Automating file downloads in your improvement and evaluation routines can prevent plenty of time. As demonstrated by this tutorial implementing a parallel obtain routine can vastly lower file acquisition time in case you require many recordsdata or massive recordsdata.