Web Scraping on dynamic website(CSR) with python and selenium

Introduction

There are many ways for web scraping nowadays whether with Javascript(which I think is the easiest one) or other programming languages. In this article, I will focus on Python because you know this is the age of Data Science and Data scientist love Python also I’m recently obsessed with Data analyst field.

There are 3 popular libraries to work on Web scraping with Python which are Selenium, Beautiful Soup, and Scrapy (How they’re different you can watch this youtube video). This content I will show you how to work with Selenium and Beautiful Soup which is intuitive and easy to implement.

Prerequisites

Install selenium and Beautiful Soup packages

                                            pip install selenium
pip install BeautifulSoup4

Any text editor but I would recommend Jupiter notebook or google colab because It easier to debug
Selenium web driver. This is another one important thing because selenium required a web driver to create a new webpage that allows selenium to control. I will use Chrome driver for this tutorial. You can download Chrome driver here. **Before download Chrome driver please check your google chrome version whether it compatible with your Chrome version or not.

Let's Start

I will scrape the whole reverb guitar search page source as an example.

                                        from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver = webdriver.Chrome('directory/to/web_driver')
driver.get("https://reverb.com/marketplace/electric-guitars?make=fender&page=1")
time.sleep(3)
html_source = driver.page_source
soup = BeautifulSoup(html_source,'lxml')

As you can see only 5 lines of code, We have a whole HTML page source from a specific URL. I will explain line by line from the above chunk of code.

1.Execute web driver (The one downloaded) and store to driver variable

driver = webdriver.Chrome('directory/to/web_driver')

2. Tell web driver to open URL

driver.get("https://reverb.com/marketplace/electric-guitars?make=fender&page=1")

3. This line is very important we need to set the sleep time and wait for the web page to fully loaded because It’s a Dynamic website or some call Client Side Rendered(CSR). You can read more about CSR and other types here.

time.sleep(3)

4. Get page source

html_source = driver.page_source

5. Call Beautiful Soup and parse HTML source in any parse you want. You can finf parser options from Beautiful Soup documents here.

soup = BeautifulSoup(html_source,'lxml')

So yeah, That’s pretty much it. How to web scraping in a very simple way.

Bonus

What if you want to know all the guitar products name and prices. Basically, After we scraped page source which means we have every information on that specific page. We can use Beautiful Soup to get those particular informations.

The concept is very easy. Search for tag, class, or id that store product name or whatever you want to know. (we will use those tag, class, or id as an arguments for Beautiful Soup). You can used inspect element feature in google chrome to find out what tag, class, or id by press cmd+opt+i in mac.

Product name

For product name after I inspected the HTML source, I found that product name is in ‘h4’ tag and class ‘grid-card__title’

I’ve already stored HTML source in soup variable so I can call a function call ‘find_all’ and pass 2 arguments the first one is tag name and class name to find a product name.

all_product_name = soup.find_all('h4', class_='grid-card__title')

Here’s the result below, You can see that we have all specific h4 tag and class grid-card__title.

So what if we want only the text inside those tag ?? just use .text function and loop through this list.

[i.text for i in all_product_name]

BAM! We got what we want.

Conclusion

Scraping page source

Install Selenium
Install Beautiful Soup
Download web driver
Get specific information after scraped with Beautiful Soup

And that’s all a basic of web scraping on dynamic website in just 5 lines of code. :)

Github : Source code

Github Gist : Source code