Amazon is one of the world’s largest online marketplaces with millions of products listed on the platform. Many customers rely on reviews to make informed buying decisions. As an amazon seller, you might want to scrape reviews to gather feedback on your product, monitor competitors, or perform market research.
In this tutorial, we will show you how to scrape Amazon reviews using Python and proxies. Proxies are necessary because Amazon might block your IP address if you make too many requests in a short period of time. By using proxies, you can distribute your requests across different IP addresses to avoid detection.
Prerequisites
Before we start, make sure you have the following installed on your system:
- Python 3
- requests library
- BeautifulSoup library
- csv library
You can install these libraries using the following commands in your command prompt:
pip install requests pip install BeautifulSoup4 pip install csv
Step 1: Import Libraries
import requests from bs4 import BeautifulSoup import csv import random import time
- requests: for making HTTP requests to Amazon
- BeautifulSoup: for parsing HTML of the review pages
- csv: for writing the scraped data to a CSV file
- random: for randomly choosing a proxy and user-agent string
- time: for delaying between requests to avoid detection
Step 2: Set up Proxies
Now, you need to set up a list of proxies that you will use to scrape Amazon reviews. You can find free proxy lists online. Here is an example list of proxies:
proxies = ['http://10.10.1.10:3128', 'https://10.10.1.11:1080', 'http://10.10.1.10:80']
Step 3: Set up User-Agent
Amazon might block your request if you use a default user-agent. To avoid this, you can use a user-agent string. You can use a list of user-agent strings and choose one randomly for each request. Here is an example list of user-agent strings:
user_agents = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393', ]
Step 4: Scrape Reviews
Now you can start scraping Amazon reviews. Here is an example function that scrapes reviews from the first page of a product listing:
def scrape_reviews(product_url, num_reviews): # Open CSV file for writing with open('reviews.csv', 'w', newline='', encoding='utf-8') as file: writer = csv.writer(file) writer.writerow(['Review', 'Rating', 'Date']) # Loop through pages of reviews for i in range(1, num_reviews + 1): # Choose a random proxy and user-agent proxy = {'http': random.choice(proxies)} headers = {'User-Agent': random.choice(user_agents)} # Make request to review page response = requests.get(product_url + f'/product-reviews/{i}', proxies=proxy, headers=headers) soup = BeautifulSoup(response.content, 'html.parser') # Extract review information and save to CSV file reviews = soup.find_all('div', {'class': 'a-section review aok-relative'}) for review in reviews: review_text = review.find('span', {'class': 'a-size-base review-text review-text-content'}).text.strip() rating = review.find('span', {'class': 'a-icon-alt'}).text[:3] date = review.find('span', {'class': 'a-size-base a-color-secondary review-date'}).text.strip()[3:] writer.writerow([review_text, rating, date])
Alternatives Solution
If you don’t know how to use Python, you can use a web scraper to scrape Amazon reviews. A web scraper is a tool that extracts data from websites automatically. There are many web scraping tools available, some of which are free and some of which are paid.
There are lots of amazon scrapers online, One popular web scraper is ParseHub. ParseHub is a free web scraper that allows you to extract data from websites using a point-and-click interface. Here’s how you can use ParseHub to scrape Amazon reviews:
- Go to the ParseHub website and create an account.
- Install the ParseHub browser extension.
- Open Amazon and go to the product page you want to scrape.
- Click on the ParseHub browser extension and select “Create New Project”.
- Use the point-and-click interface to select the review text, rating, and date.
- Run the scraper and download the data as a CSV file.
Note that scraping Amazon reviews without permission may violate Amazon’s terms of service. Use this method at your own risk and always make sure to follow ethical and legal guidelines.
Conclusion
Scraping Amazon reviews can be a useful tool for sellers to monitor feedback on their products and gain insights into market trends. However, it is important to use proxies to avoid detection and ensure that your requests are not blocked by Amazon.
In this tutorial, we showed you how to scrape Amazon reviews using Python and proxies. We used the requests library to make HTTP requests, BeautifulSoup to parse the HTML of the review pages, and csv to write the scraped data to a CSV file.
We also showed you how to set up a list of proxies and randomly choose one for each request to avoid detection by Amazon. Additionally, we used a list of user-agent strings to prevent Amazon from blocking our requests.
Remember that scraping Amazon reviews without permission may violate Amazon’s terms of service and could lead to legal consequences. Use this tutorial at your own risk and always make sure to follow ethical and legal guidelines.