3 ways to scrape CSS from a website

User avatar for Bart Veneman Bart Veneman in blog

Over the years I have explored several ways to get hold of a website’s CSS. So far I’ve found these three options, each with their pros and cons. None of these methods actually check if the CSS is actually used, they only collect as much CSS as they possibly can.

Table of contents

  1. Option 1: Scrape the HTML
  2. Option 2: Use CSSOM
  3. Option 3: Use CSSCoverage API
  4. Summary

Option 1: Scrape the HTML

Project Wallace scrapes websites by fetching the HTML and then going through all HTML elements to grab bits of CSS out of them. This is a fast and cheap way to scrape websites because it does not involve a headless browser but only a handful of dependencies and some clever thinking. More specifically it looks like this:

  1. Fetch the HTML document belonging to the URL you’ve entered (this works best in a NodeJS enviroment but in some cases is also possible in the browser)
  2. Parse the HTML into an AST
    1. Tip: use DOMParser.parseFromString() if you’re in a browser environment or linkedom if you’re in a JavaScript engine like NodeJS
  3. Walk the AST and grab every <style> element
    1. Each <style>’s contents can be added to our CSS as-is
  4. Walk the tree and grab every <link rel~="stylesheet">
    1. Grab the href from the <link>
    2. Fetch the href’s contents
    3. Add the contents to our CSS
  5. Walk the tree and grab every [style] element
    1. The style contents of <div style="color: red; margin: 0"> can be taken as-is
    2. Make up a selector and rule for the single element (like div in this example), or one selector and rule for all elements with inline styles (inline-styles { color: red, etc. })
    3. Add the inline CSS to the rule
    4. Add the rule(s) to our CSS
  6. Recursively scrape any CSS @import
    1. Parse the CSS into an AST
    2. Walk the tree and take each import atrule
    3. Take the url() of the import
    4. Download the contents of the URL
    5. Add to our CSS

Pros and cons

✅ Pros❌ Cons
Cheap to run on a serverA lot of work to manage state, timeouts, error handling and data flows
Returns the CSS as it was sent to the browser / as authoredDoes not easily fit in a bookmarklet
Can be run in your browser or other JavaScript runtimesDoes not find adoptedStylesheets or CSS injected with runtime CSS-in-JS

Option 2: Use CSSOM

The CSSOM is a collection of APIs that can be used to manipulate CSS from JavaScript. Part of this is the document.styleSheets property that we can use to grab all the CSS from a webpage. It’s such a small task that I’ll put the entire script here:

CSSOM Example

function scrape_css()
	let css = ''

	for (let stylesheet of document.styleSheets) { // [1]
		for (let rule of stylesheet.cssRules) { // [2]
			css =+ rule.cssText // [3]
		}
	}

	return css
}

Explanation

  1. Go over all the stylesheets of document.styleSheets
  2. Take the cssRules of each styleSheet
  3. Read the cssText property from each CSSRule and add it to our css string. This sometimes causes Cross Origin issues so you may want to wrap that in a try-catch block.

Pros and cons

✅ Pros❌ Cons
Much simpler than HTML scrapingRequires a browser (‘real’ or headless), making it more expensive than HTML scraping to run on a server
Fits in a bookmarklet easilyDoes not return the CSS in the format that it was authored in (it changes color notations, etc.)
Can be run in your browser or any JavaScript runtime that supports running (headless) browsersDoes not scrape inline styles
Cross Origin errors sometimes happen and are hard to solve

Option 3: Use CSSCoverage API

Headless browsers and Chromium-based browsers have the CSSCoverage API which can be used to detect which parts of your CSS are actually used and which parts aren’t. A great API in itself but we can also use it to find all the CSS.

CSSCoverage Example

import { chromium } from 'playwright' // or 'puppeteer'

async function scrape() {
	let browser = await chromium.launch() // [1a]
	let page = await browser.newPage() // [1b]

	await page.coverage.startCSSCoverage() // [2]
	await page.goto('https://example.com') // [3]
	let coverage = await page.coverage.stopCSSCoverage() // [4]

	let css = ''
	for (let entry of coverage) {
		css += entry.text
	}

	return css
}

Explanation

  1. Create a new browser and page
  2. Tell the browser to prepare to collect some coverage information. This must be done before going to a URL if you want to know all the CSS on the page after it loads
  3. Go to the actual URL you want to scrape
  4. Collect the coverage that the browser has covered
  5. Go over the coverage report and extract the CSS

Pros and cons

✅ Pros❌ Cons
Much simpler than HTML scrapingRequires a browser (‘real’ or headless), making it more expensive than HTML scraping to run on a server
Can be run in any JavaScript runtime that supports running (headless) browsersDoes not run in a bookmarklet
CSSCoverage can also be collected between opening a page, doing interactions and navigating to other pages

Summary

Each of these methods has their pros and cons so it really depends on the use case what you’ll end up using.

HTML ScraperCSSOMCSSCoverage API
Leaves CSS intact
Cost to run on server💰💰💰💰💰
Complexity1001030
Runs in bookmarklet✅ (a big bookmarklet)
Scrape inline styles

Hope this was helpful. Did I miss anything? Let me know!

Back to blog

Popular posts