In this blog and guide, we will show you how to scrape any website using an AI chatbot, such as ChatGPT! We will also discuss what web scraping is, the legality of web scraping and if AI-generated web scraping is viable. We will end off with some alternatives that do a great job at web scraping large amounts of data.
If you have been keeping up with recent news and trending posts, you probably have heard of the chatbot designed by OpenAI called ChatGPT. People have been using ChatGPT to do their homework, and some people have been building apps with it. Yes, ChatGPT can generate code, and that is the method we are going to use today to web scrape any website!
What is web scraping exactly and what is it good for? Web scraping is any method of harvesting data off a webpage or multiple websites. Most businesses use web scraping for market research, sales, lead generation, price monitoring, SEO strategy and much more. Enterprise businesses can legally scrape large amounts of financial data, and even use web scraping for enterprise-scale machine learning; which is how ChatGPT works itself!
Is web scraping even legal? Web scraping is generally legal as long as the information being scraped is publicly available. You should also consider the ethics of web scraping, and ensure you are running optimal scripts that do not slow down a website’s server.
Okay, so web scraping is useful and legal, so let’s start web scraping using AI!
How to Web Scrape Any Website Using ChatGPT
It would be great if AI could automatically web scrape any website for you within a chat box, however, web scraping, especially enterprise web scraping requires a lot of resources to work. Unlike a math problem, web scraping is much more complex and requires a program and/or server to harvest data. Although this is a bummer, it makes sense, ChatGPT is not able to fetch data from the internet as it’s a pre-trained language model.
Here is what happens when you try to get ChatGPT to directly web scrape for you:
We asked ChatGPT:
“Can you scrape the top 100 rated restaurants in Toronto, with their contact details?”
ChatGPT replies:
“I am not able to scrape the web for you, but I can give you some general information about how to do it. To scrape the top 100 rated restaurants in Toronto, you will need to use a web scraping tool or library, such as BeautifulSoup or Scrapy, to extract the information from the website of a restaurant rating service, such as Yelp or TripAdvisor. You will also need to have some knowledge of programming and the structure of web pages in order to effectively extract the information you need. Additionally, scraping data from a website without permission is usually against the site’s terms of use and can be illegal in some jurisdictions.”
Okay, at least ChatGPT gave us some direction. BeautifulSoup and Scrapy are libraries you can use with programming languages such as Python. Therefore, we can use ChatGPT to give us code which we can use with Python.
Our new ChatGPT Query:
“Can you write me code to scrape restaurant details from: https://www.yelp.com/search?find_desc=Restaurants&find_loc=Toronto%2C+Ontario?”
ChatGPT Replied:
“Sure, here is an example of how you could use the Python library BeautifulSoup to scrape restaurant details from the Yelp website for Toronto:” and generated the following code:
import requests
from bs4 import BeautifulSoup
url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=Toronto%2C+Ontario"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
restaurant_divs = soup.find_all("div", class_="lemon--div__373c0__1mboc margin-t3__373c0__1l90z margin-b3__373c0__q1tGH border--top__373c0__3gXLy border-color--default__373c0__3-ifU")
for restaurant_div in restaurant_divs:
name = restaurant_div.find("a", class_="lemon--a__373c0__IEZFH link__373c0__29943 link-color--blue-dark__373c0__1mhJo link-size--inherit__373c0__2JXk5").text
address = restaurant_div.find("address").text
phone = restaurant_div.find("div", class_="lemon--div__373c0__1mboc arrange-unit__373c0__o3tjT arrange-unit-fill__373c0__3Sfw1 border-color--default__373c0__3-ifU").text
print(f'Name: {name}\nAddress: {address}\nPhone: {phone}\n---')
Let’s try the code in Python!
- Firstly, remember to install BeautifulSoup on your system, using a command such as: “pip install beautifulsoup4”
- Now let’s add the code to our favourite Python IDE.
- Click Run and hopefully restaurant names, addresses and phone numbers should be printed!
Hmm, it seems like the application ran, and nothing was printed:
This is probably because websites change their code and layouts all the time. Therefore, the code ChatGPT generated was probably an outdated code it found from a resource a while ago.
So is there a way to fix ChatGPT’s code? There is and we tried. It’s not easy creating web scraping code for a website such as Yelp. We tried changing the target divs, but it seems like they change dynamically, aside from many other enterprise web scraping obstacles. Even if the code works perfectly, websites still deter scripts from web scraping their data using many methods, which ChatGPT even warned us about, such as Captchas and IP Blocking. (both of which ParseHub can bypass)
Is ChatGPT Web Scraping Viable?
With all the ChatGPT hype, it’s disappointing to know it cannot web scrape for you, or even generate working web scraping code. Ian from The Atlantic says to treat ChatGPT like a toy, not a tool, and it’s true. It was designed for chatting with, and not yet for replacing programmers and scraping tools. So is ChatGTP web scraping viable? It’s only viable to teach you the basics of programming and web scraping. You can ask ChatGPT about web scraping, the legality of web scraping and generate sample web scraping code. However, there are too many obstacles in web scraping, especially enterprise-level scraping. In fact, there are many cons to web scraping with Python or any coding language. Standalone and visual software are much easier to scrape with and don’t require you to have extensive coding knowledge.
ChatGPT Web Scraping Alternative: ParseHub
ParseHub is the leading no-code, visual web scraping tool on the market. You can easily scrape Yelp, or any website, without the need for any programming or chatbot shenanigans. ParseHub allows you to point and click the data you wish to scrape and comes with extensive functionality that replaces any method or loops you may need to program in your script. ChatGPT only shows you a sample of code, it does not show you how to parse through pages, bypass blocks, scrape additional nested data or anything else beyond a simple example. Even for a simple use-case, ChatGPT was not able to produce working code. Although ParseHub is scraping software and not an AI, it actually works! We recommend you try ParseHub for free, no credit card required, to get all your web scraping done. Here is how you can easily web scrape any website for free.
Bonus: Managed Enterprise Web Scraping
If you found this article, chances are you would like more big data and an easy web scraping solution. Who wouldn’t like an AI to scrape all their online data? Since ChatGPT cannot scrape the essential data your business requires at the moment, we have the next best solution. ParseHub Plus is the enterprise, managed version of ParseHub. This means all your data sourcing, scraping and validating is done by a managed web scraping team. You will not even need to use the ParseHub software, and therefore you can pretend all the big data received is from AI! Many corporations use large-scale web scraping, some of which are our managed clients.
Looking to scrape large amounts of data with minimal effort? Book a call with ParseHub Plus, and get your free sample data export!
Happy Scraping!