Python Requests Web Scraping



Web scraping with python requests

Requests Requests is a Python library used to easily make HTTP requests. Generally, Requests has two main use cases, making requests to an API and getting raw HTML content from websites (i.e., scraping). Check out the tutorial on how to scrape dynamic web pages with Python. Learn how to extract data with Selenium, headless browsers, and the web scraping API. Browse other questions tagged python-3.x web-scraping beautifulsoup python-requests or ask your own question. The Overflow Blog Podcast 330: How to build and maintain online communities, from gaming to. Jan 19, 2019 Scraping data from a JavaScript webpage with Python 19 Jan 2019 This post will walk through how to use the requestshtml package to scrape options data from a JavaScript-rendered webpage. Requestshtml serves as an alternative to Selenium and PhantomJS, and provides a clear syntax similar to the awesome requests package.

Learning Outcomes

  • To understand the benefits of using async + await compared to simply web scraping with the requests library.
  • Learn how to create an asynchronous web scraper from scratch in pure python using asyncio and aiohttp.
  • Practice downloading multiple webpages using Aiohttp + Asyncio and parsing HTML content per URL with BeautifulSoup.

The following python installations are for a Jupyter Notebook, however if you are using a command line then simply exclude the ! symbol

Python Requests Web Scraping

Note: The only reason why we use nest_asyncio is because this tutorial is written in a jupyter notebook, however if you wanted to write the same web scraper code in a python file, then you would’nt need to install or run the following code block:

Scraping

Why Use Asychronous Web Scraping?

Writing synchronous web scrapers are easier and the code is less complex, however they’re incredibly slow.

This is because all of the requests must wait for the current request to finish one by one. There can only be one request running at a given time.

In contrast, asynchronous web requests are able to execute without depending on previous requests within a queue or for loop. Asychronous requests happen simultaneously.

How Is Asychronous Web Scraping Different To Using Python Requests?

Instead of thinking about creating a for loop with Xn requests, you need to think about creating an event loop. For example the environment for NodeJS, by design executes in a single threaded event loop.

However for Python, we will manually create an event loop with asyncio.

Inside of your event loop, you can set a number of tasks to be completed and every task will be created and executed asychronously.

How To Web Scrape A Single Web Page Using Aiohttp

Firstly we define a client session with aiohttp:

Then with our session, we execute a get response on a single URL:

Thirdly, notice how we use the await keyword in front of response.text() like this:

Also, note that every asynchronous function starts with:

Finally we run asyncio.run(main()), this creates an event loop and executes all tasks within it.

After all of the tasks have been completed then the event loop is automatically destroyed.

How To Web Scrape Multiple Pages Using Aiohttp

Scraping

Python Requests Web Scraping For Prices

When scraping multiple pages with asyncio and aiohttp, we’ll use the following pattern to create multiple tasks that will be simulataneously executed within an asyncio event loop:

To start with we create an empty list and then for every URL, we will attach an uncalled/uninvoked function, an AioHTTP session and the URL to the list.

The asyncio.gather(*tasks), basically tells asyncio to keep running the event loop until all of these functions within the python have been completed. It will return a list that is the same length as the number of functions (unless one of the functions within the list returned zero results).

Now that we know how to create and execute multiple tasks, let’s see this in action:

Adding HTML Parsing Logic To The Aiohttp Web Scraper

As well as collecting the HTML response from multiple webpages, parsing the web page can be useful for SEO and HTML Content Analysis.

Therefore let’s create second function which will parse the HTML page and will extract the title tag.

Python Web Scraping Requests Beautifulsoup

Conclusion

Python Requests Web Scraping Definition

Asynchronous web scraping is more suitable when you have a larger number of URLs that need to be processed quickly.

Python Web Scraping Using Requests

Also, notice how easy it is to add on a HTML parsing function with BeautifulSoup, allowing you to easily extract specific elements on a per URL basis.