Downloading HTML from a Website

When you visit a website, everything you see on your screen is created using HTML, or HyperText Markup Language. This is basically the foundation of every webpage you see on the internet. If you’re interested in collecting data from websites, testing, or just trying to understand how webpages are structured and work, you might want to save a copy of a website’s HTML. This guide will show you how to download the HTML of a website, but please make sure you respect the website owners and follow the law before doing so.

Also see: Download All Files From a Website Directory Using Wget in Windows 11 or 10

First up, remember to play nice and check the website’s rules and the law before you go downloading anything. Look for a file called robots.txt on the site, and make sure what you’re doing is okay.

Page Contents

What is HTML?

HTML is the backbone that makes websites possible. It’s a computing language that web browsers use to display webpages graphically on your screen. In many websites, HTML also works with other technologies like CSS (which makes pages pretty and colorful) and JavaScript (which allows pages to do complicated stuff).

An HTML document is just a text file with HTML code, marked by tags that tell the browser how to show different parts of the page, like “heading,” “paragraph,” “table,” and so on.

Related resource: How to Convert HTML to PDF in Windows 11/10

How to download HTML file from a website

There are a few ways to download the HTML source code from a website, depending on what you need. We’ll go over some common methods like using a web browser, command line, and even Python.

Using a Web Browser

We’ll focus on Google Chrome here, but other browsers work similarly.

Downloading HTML source code from a website

Open Google Chrome and go to the webpage you want.
Right-click on the page and select “View Page Source” to see the HTML in a new tab.
Right-click in the new tab and choose “Save As” to save the HTML file on your computer.

Useful tip: How to Run HTML Code in Notepad Windows 11

Downloading HTML with pictures, CSS, and JS from a website

If you’re looking to grab an entire webpage with all its bells and whistles like pictures, CSS, and JS files, along with its HTML source code, here’s how you can do it:

First up, open Google Chrome and head to the webpage you want to download. Once you’re there, just right-click anywhere on the page and select “Save As”.
Next, a box will pop up asking where you’d like to save the HTML file on your computer. Hit “Save”, and just like that, you’ve got the webpage’s HTML saved on your device.

When you save a webpage like this, whether it’s with Chrome or another browser, you’ll get two things:

An HTML file: This is the .html file with the webpage’s HTML, which includes the structure and the content. It points to any external stuff the webpage uses, like images, CSS, and JavaScript files.
A resources folder: Along with the HTML file, you’ll get a folder named just like the HTML file but with ‘_files’ at the end. This folder will have all the webpage’s resources that are now saved locally. This includes images, CSS files, JavaScript files, and more, which are needed for the webpage to look properly offline.

Chrome save the HTML as a single file

Note that the browser will change some of the HTML file’s links to these resources so they point to the local copies in the resources folder (e.g., instead of pointing to example.com/picture.jpg, it might be changed so that it points to C:\desktop\website\picture.jpg). This lets the webpage load up just fine even when you’re offline.

However, please know that not everything you download that way will work offline. For example, server-side scripts (like PHP) won’t run, and if the webpage gets resources on the fly with JavaScript, those will not be downloaded either.

Recommended guide: How to Check When a Web Page Was Last Updated

Why doesn’t Chrome save the HTML as a single file?

The main reason is all about keeping the webpage working and looking as it should. Webpages are complex, with not just HTML but also CSS for the looks and JavaScript for the action. These bits are usually in separate files. If Chrome crammed everything into one HTML file, it would have to stuff all these resources directly into the HTML, which can mess with how easy it is to read, edit, and how well it runs.

So, to keep everything running smoothly and looking good, Chrome saves the HTML and its resources as separate files, rather than mixing it all into one file.

Using the command line

If you’re on Linux or MacOS, you can use commands like wget or curl to download a website’s HTML.

Using wget:
```
wget https://example.com
```
This saves the webpage’s HTML in your current folder.
Using curl:
```
curl https://example.com > example.html
```
This saves the webpage’s HTML into a file named “example.html”.

Download HTML from website using command line

Using Python

Python is also extremely good for downloading HTML of webpages. The following is a quick script to do so:

import requests

url = "https://example.com"
response = requests.get(url)

with open('example.html', 'w') as file:
   file.write(response.text)

Download HTML from website using Python

This Python script will request a website for its HTML and then saves it to a file called “example.html” on your computer. If you’re new to Python and encounter any error that say it can’t find a module, you will need to install the module using Python’s package manager (pip). Just open your command line and type:

pip install requests

This will install the needed module. If you need another module, you can install it using the same way.

Advanced usage – Web scraping

Once you’ve got the HTML, you might also want to pull out only specific parts of the webpage. This is where web scraping comes in handy. Python has a library called beautifulsoup4 that’s perfect for this. It lets you sift through HTML very easily.

The following is a quick example:

from bs4 import BeautifulSoup
import requests

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Let's say we want to extract all the headers in the page
headers = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])

for header in headers:
   print(header.text)

In this script, BeautifulSoup is used to parse the HTML. The find_all function is then used to find all header tags in the HTML. The script then prints the text of each header tag.

Web Scraping Python using BeautifulSoup

Dealing with dynamic content

Some websites load their content on the fly, meaning you might not get everything by just saving the HTML. For these cases, you can use tools like Selenium, Puppeteer, or Playwright. These tools actually open up a web browser, let the page load completely, and then let you access the HTML.

Here’s how you can use Selenium with Python:

from selenium import webdriver

# Make sure the chromedriver is in your PATH
driver = webdriver.Chrome()

driver.get('https://example.com')

html = driver.page_source

with open('example.html', 'w') as file:
   file.write(html)

driver.quit()

Use Selenium to download HTML from website in Python

This method can be a bit slow and uses more resources, but it’s good for getting everything from pages that load content dynamically.

One last thing

Whether you’re into coding, testing, or just curious about how websites work, knowing how to download HTML is a very useful skill. There are lots of ways to do it, depending on what you need and what tools you’ve got. Just make sure to always respect the website’s rules and only download stuff when it’s okay to do so.