3 ways to scrape CSS from a website
Over the years I have explored several ways to get hold of a website’s CSS. So far I’ve found these three options, each with their pros and cons. None of these methods actually check if the CSS is actually used, they only collect as much CSS as they possibly can.
Table of contents
Option 1: Scrape the HTML
Project Wallace scrapes websites by fetching the HTML and then going through all HTML elements to grab bits of CSS out of them. This is a fast and cheap way to scrape websites because it does not involve a headless browser but only a handful of dependencies and some clever thinking. More specifically it looks like this:
- Fetch the HTML document belonging to the URL you’ve entered (this works best in a NodeJS enviroment but in some cases is also possible in the browser)
- Parse the HTML into an AST
- Tip: use
DOMParser.parseFromString()
if you’re in a browser environment or linkedom if you’re in a JavaScript engine like NodeJS
- Tip: use
- Walk the AST and grab every
<style>
element- Each
<style>
’s contents can be added to our CSS as-is
- Each
- Walk the tree and grab every
<link rel~="stylesheet">
- Grab the
href
from the<link>
- Fetch the
href
’s contents - Add the contents to our CSS
- Grab the
- Walk the tree and grab every
[style]
element- The
style
contents of<div style="color: red; margin: 0">
can be taken as-is - Make up a selector and rule for the single element (like
div
in this example), or one selector and rule for all elements with inline styles (inline-styles { color: red, etc. }
) - Add the inline CSS to the rule
- Add the rule(s) to our CSS
- The
- Recursively scrape any CSS
@import
- Parse the CSS into an AST
- Walk the tree and take each
import
atrule - Take the
url()
of the import - Download the contents of the URL
- Add to our CSS
Pros and cons
✅ Pros | ❌ Cons |
---|---|
Cheap to run on a server | A lot of work to manage state, timeouts, error handling and data flows |
Returns the CSS as it was sent to the browser / as authored | Does not easily fit in a bookmarklet |
Can be run in your browser or other JavaScript runtimes | Does not find adoptedStylesheets or CSS injected with runtime CSS-in-JS |
Option 2: Use CSSOM
The CSSOM is a collection of APIs that can be used to manipulate CSS from JavaScript. Part of this is the document.styleSheets
property that we can use to grab all the CSS from a webpage. It’s such a small task that I’ll put the entire script here:
CSSOM Example
function scrape_css()
let css = ''
for (let stylesheet of document.styleSheets) { // [1]
for (let rule of stylesheet.cssRules) { // [2]
css =+ rule.cssText // [3]
}
}
return css
}
Explanation
- Go over all the stylesheets of
document.styleSheets
- Take the
cssRules
of eachstyleSheet
- Read the
cssText
property from eachCSSRule
and add it to ourcss
string. This sometimes causes Cross Origin issues so you may want to wrap that in a try-catch block.
Pros and cons
✅ Pros | ❌ Cons |
---|---|
Much simpler than HTML scraping | Requires a browser (‘real’ or headless), making it more expensive than HTML scraping to run on a server |
Fits in a bookmarklet easily | Does not return the CSS in the format that it was authored in (it changes color notations, etc.) |
Can be run in your browser or any JavaScript runtime that supports running (headless) browsers | Does not scrape inline styles |
Cross Origin errors sometimes happen and are hard to solve |
Option 3: Use CSSCoverage API
Headless browsers and Chromium-based browsers have the CSSCoverage
API which can be used to detect which parts of your CSS are actually used and which parts aren’t. A great API in itself but we can also use it to find all the CSS.
CSSCoverage Example
import { chromium } from 'playwright' // or 'puppeteer'
async function scrape() {
let browser = await chromium.launch() // [1a]
let page = await browser.newPage() // [1b]
await page.coverage.startCSSCoverage() // [2]
await page.goto('https://example.com') // [3]
let coverage = await page.coverage.stopCSSCoverage() // [4]
let css = ''
for (let entry of coverage) {
css += entry.text
}
return css
}
Explanation
- Create a new browser and page
- Tell the browser to prepare to collect some coverage information. This must be done before going to a URL if you want to know all the CSS on the page after it loads
- Go to the actual URL you want to scrape
- Collect the coverage that the browser has covered
- Go over the coverage report and extract the CSS
Pros and cons
✅ Pros | ❌ Cons |
---|---|
Much simpler than HTML scraping | Requires a browser (‘real’ or headless), making it more expensive than HTML scraping to run on a server |
Can be run in any JavaScript runtime that supports running (headless) browsers | Does not run in a bookmarklet |
CSSCoverage can also be collected between opening a page, doing interactions and navigating to other pages |
Summary
Each of these methods has their pros and cons so it really depends on the use case what you’ll end up using.
HTML Scraper | CSSOM | CSSCoverage API | |
---|---|---|---|
Leaves CSS intact | ✅ | ❌ | ✅ |
Cost to run on server | 💰 | 💰💰 | 💰💰 |
Complexity | 100 | 10 | 30 |
Runs in bookmarklet | ✅ (a big bookmarklet) | ✅ | ❌ |
Scrape inline styles | ✅ | ❌ | ❌ |
Hope this was helpful. Did I miss anything? Let me know!
Popular posts
Making Analyze CSS render 6 times faster
A deep-dive in how the Analyze CSS page renders 6 times faster by applying 2 basic principles.
CSS complexity: it's complicated
There's lots of places in CSS to have complexity, but we tend to focus on selectors most of the time. Let's have a look at other places too.