Scraping Amazon web page using Selenium

Scraping Amazon web page using Selenium

Introduction

In today's world, data is an essential component, particularly with the current rise of artificial intelligence. The capacity to access data that can be analyzed to draw insights is critical. Web scraping offers data analysts and data scientists an automated way to gather data from websites. While there are other methods of collecting data from websites, such as using APIs or manually gathering data, web scraping is often the preferred method when these options are not available. In this article, we will explore a step-by-step guide to scraping the Amazon website using Selenium and Python.

Selenium is an open-source tool that automates web browsers. It is widely used for web scraping and automated testing of web applications. By using Selenium, we can interact with the web page, retrieve data, and automate repetitive tasks. Amazon has taken measures to block scraping bots, so we need to ensure that our code avoids detection.

Requirements

To follow this tutorial you will need the following:

  1. Python 3.5 or higher version installed on your system.

  2. Jupyter Notebook installed on your system (or any other IDE of your choice). You can download Jupyter Notebook from their official website.

  3. Basic knowledge of Python programming language.

  4. Basic knowledge of HTML.

Installing necessary packages

The first step is to install the necessary packages for scraping Amazon using Selenium. We need to install Selenium using the command below:

!pip install selenium

Also, we need to download the web driver for the browser we are going to use. For this article, we will use Chrome. Download the Chrome web driver here, and ensure to save the driver in the folder that you know its path.

# Import webdriver
from selenium import webdriver
# Import time
import time
# import service
from selenium.webdriver.chrome.service import Service
# Import NoSuchElementException
from selenium.common.exceptions import NoSuchElementException
# Import By
from selenium.webdriver.common.by import By

# Adding your Chrome driver path
CHROME_DRIVER_PATH = 'Users/user/Downloads/chromedriver_mac64/chromedriver'

# initialize the Selenium WebDriver
driver = webdriver.Chrome(executable_path=CHROME_DRIVER_PATH)

# create a driver object using driver_path as a parameter
driver = webdriver.Chrome(service = Service(executable_path=driver_path))

To manage our web driver, we will use web driver-manager. Install web driver manager using the command below.

pip install webdriver-manager

Run the following code to configure the web driver.

# Import webdriver manager
from webdriver_manager.chrome import ChromeDriverManager 

# Create object ChromeOptions()
chrome_options = webdriver.ChromeOptions()            
chrome_options.add_argument('--headless')           
chrome_options.add_argument('--no-sandbox')                             
chrome_options.add_argument('--disable-dev-shm-usage')

driver= webdriver.Chrome(options=chrome_options, service = Service(ChromeDriverManager().install()))

In the code above we configured the options for the Chrome web driver to run in headless mode. This means that the browser will run in the background without a visible user interface. We also disabled the Chrome sandbox and the use of /dev/shm (a shared memory segment in Linux) by Chrome.

NB: The Webdriver manager can be used with different browsers. Check out the documentation to know the different browsers available.

Accessing the Amazon webpage

The next step is to access the Amazon webpage from our IDE. Enter the code below to open the browser:

# assign your website to scrape
URL = 'https://www.amazon.com'
# Opening the URL
driver.get(URL)

The code above would open the Amazon webpage using the driver.get() function

Accessing Html elements on the webpage:

Now that we have successfully opened the Amazon webpage, our next step is to search for the product we want to extract. Just like you’ll normally use the search column to look for specific products, we would do the same from our IDE.

To extract information, we need to right-click on the Amazon search results page, then click on Inspect and navigate to the Elements tab. Look for an HTML tag that is unique to each of the items. This will ease the extraction process.

Selenium provides many methods for selecting page elements. We can select elements by: ID, class name, XPath, name, tag name, link text, and CSS Selector. You can also use relative locators to target page elements relative to other elements.

Here is a function that extracts specific details from the Amazon web service:

def scrape_amazon(search_term, max_pages=10):
    # Find the search box and input the search term
    search_field = driver.find_element(By.ID,'twotabsearchtextbox')
    search_field.send_keys(search_term)
    # Click the search button
    search_button = driver.find_element(By.ID,'nav-search-submit-button')
    search_button.click()
        # set a timer delay
        time.sleep(5)

Basically, we are writing a function that takes our search_item, and also the number of pages we want to scrape. To do this, we first identified the element ID of the search column as 'twotabsearchtextbox' on the Amazon webpage**.** We used this ID to enter the search term "iPhone" into the search column. To initiate the search, we identified the ID of the search icon, 'nav-search-submit-button' and used the click() function to click on it. We also added a 5-second time delay to avoid triggering any detection mechanisms that may be in place on the website.

Now that we’ve hit the search button, we want to be able to recognize the search results and store them in a list.

NB: All the code given from here on is a part of a single function, broken down for explanation purposes. At the end of the article, the overall function would be given.

continuation of the function:

# Initialize an empty list to store the results
results = []

# Loop through the specified number of pages
for i in range(max_pages):
      # Wait for the page to load
      time.sleep(5)
      # Find all of the search result elements on the page
      search_results = driver.find_elements(By.XPATH, '//div[contains(@class, "s-result-item s-asin")]')

In this code, we have identified the element of the search results from the webpage HTML. We then create a loop that iterates over each search result. Basically, we are going from page to page to collect the results of the items displayed. For instance, if you’ve searched for iPhones, the search would display a number of iPhones on several pages, each page can contain five(5) iPhones with their details, from names, prices, specifications etc. This code basically scans through all the pages to enable collect each of the items displayed.

However, we need to pull out specific details from these items, as we do not need all the pieces of information. Say we would be needing their titles, prices and ratings, here is what our code would look like.

continuation of the function:

    # Loop through each search result and extract the product information
for result in search_results:
    # Getting the product title
    try:
        title = result.find_element(By.CSS_SELECTOR,"h2 a span").text
    except NoSuchElementException:
        title = "N/A"
    # Getting the prodcut price
    try:
        whole_price = result.find_elements(By.XPATH, './/span[@class="a-price-whole"]')
        fraction_price = result.find_elements(By.XPATH, './/span[@class="a-price-fraction"]')
        price = '.'.join([whole_price[0].text, fraction_price[0].text])
    except NoSuchElementException:
        price = "N/A"
    # Getting the product ratings
    try:
        ratings_box = result.find_elements(By.XPATH, './/div[@class="a-row a-size-small"]/span')
        ratings = ratings_box[0].get_attribute('aria-label')
    except NoSuchElementException:
        ratings = "N/A"

The code above loops through all the search results and extract the product name, price, and rating using CSS selectors. It then appends these values to the result list. If a product does not have a listed price, or ratings or we append 'N/A' to the list.

Two points to notice, the price is made up of two elements; the whole and the fraction, we have identified both elements and combined both to give the single price. Also, we have pulled just a single attribute from the rating element using the get_attribute(), as it contains several details.

Finally, we need to include in our function a clause that clicks the next button icon on the Amazon web page, to give us access to the next pages as we loop through the search results. Here is what the code looks like:

continuation of the function:

    # Click the "Next" button
try:
    next_button = driver.find_element(By.CSS_SELECTOR, "a.s-pagination-next")
    next_button.click()
except NoSuchElementException:
    break

# Close the browser window
driver.quit()

The code first tries to find the element with the CSS selector of the next button. If the element is found, the code clicks on it. If the element is not found, the code breaks out of the loop. It then closes the browser window.

Here is what the complete function looks like:

 # Writing a function that scrapes informations out of Amazon
def scrape_amazon(search_term, max_pages):
    '''
    This function scrapes the products title, price, and ratings. It takes two arguments;
    search_term: This is the item we want to scrape
    max_pages: This is the number of pages we want to loop through on the Amazon webpage.
    '''
    # Find the search box and input the search term
    search_field = driver.find_element(By.ID,'twotabsearchtextbox')
    search_field.send_keys(search_term)

    # Click the search button
    search_button = driver.find_element(By.ID,'nav-search-submit-button')
    search_button.click()

    # set a timer delay
    time.sleep(5)

    # Initialize an empty list to store the results
    results = []

    # Loop through the specified number of pages
    for i in range(max_pages):

        # Wait for the page to load
        time.sleep(5)

        # Find all of the search result elements on the page
        search_results = driver.find_elements(By.XPATH, '//div[contains(@class, "s-result-item s-asin")]')

        # Loop through each search result and extract the product information
        for result in search_results:

            # Extracting the products title
            try:
                title = result.find_element(By.CSS_SELECTOR,"h2 a span").text
            except NoSuchElementException:
                title = "N/A"

            # Extracting the products price
            try:
                whole_price = result.find_elements(By.XPATH, './/span[@class="a-price-whole"]')
                fraction_price = result.find_elements(By.XPATH, './/span[@class="a-price-fraction"]')
                price = '.'.join([whole_price[0].text, fraction_price[0].text])
            except NoSuchElementException:
                price = "N/A"

            # Extracting the products ratings
            try:
                ratings_box = result.find_elements(By.XPATH, './/div[@class="a-row a-size-small"]/span')
                ratings = ratings_box[0].get_attribute('aria-label')
            except NoSuchElementException:
                ratings = "N/A"

            # Add the product information to the list of results
            results.append({
                "title": title,
                "price": price,
                "rating": ratings
            })

        # Click the "Next" button
        try:
            next_button = driver.find_element(By.CSS_SELECTOR,"a.s-pagination-next")
            next_button.click()
        except NoSuchElementException:
            break

    # Close the browser window
    driver.quit()
    return results #Return the list

In summary, this function would take two arguments, the items we want to find(search_item), and the number of pages we want to scrape(max_pages). Once imputed, it creates a loop that searches through all the pages and extracts the titles, prices and ratings of each item displayed. These details are then stored in the result list.

Notice the NoSuchElementExceptionclause, it is useful when the information isn’t available on the page. The time.sleep() function helps us delay the loop for the number of stated seconds.

NB: As of the time of writing this, this function runs. However, if Amazon updates their system, then most of the HTML elements would be changed and this would affect the function. The elements would need to be replaced.

Using the function

If you want to search for iPhones for instance and scrape the data of the first five pages, here is how to use the function to achieve that.

search_term = 'Iphone' max_pages = 2 scrape_amazon(search_term, max_pages)

Here is what your result looks like:

And it’s a wrap, we have successfully scraped Amazon’s web page using Selenium.

Conclusion:

Getting data from certain web pages can be a challenge, especially if they do not have APIs to provide this data. Selenium provides a way to successfully scrape data from webpages. In this article, we have covered the basics of setting up Selenium to access a webpage( Amazon webpage) and writing a basic function to pull data out of the page.