How to clone a website (download HTML,CSS, JavaScript, Fonts and Images) using Website Scraper in Node.js

Learn how to download all the resources of a website to create a fully functional local copy of a website in Node.js

How many times, as frontend developers, we decided to implement a copy of some feature that a third party website has in our own website? I usually do this a lot, specially when there's no open source alternative for the feature and i don't want to write it from scratch because it would take a lot of time to end up with something decent. The easiest way to do this is by reading the source code of a web page, using Ctrl + U in Chrome for example, reading the JavaScript files as well as long as they're not minified (in websites that sell templates for example).

This is only uncomfortable when we talk about code highlighting, as you can't compare the syntax highlighting offered by the browser with the one offered by your favorite IDE like Visual Studio Code, Netbeans etc. So, it would be great if you could download a copy of the code and resources of the website to manipulate it locally right? Thanks to a pretty useful script that uses Puppeteer, this can be easily done within just minutes (and seconds after its implementation). In this article, we will explain you how to easily implement your own website cloner with Node.js.

1. Install Website Scrapper Puppeteer

One of the most valuable advantanges of using a script that is based on Puppeteer, a headless version of Chromium, is that you will not only be able to copy static website features that implement plain JavaScript or even jQuery, but you will also be able to download the content and resources generated by dynamic pages that use React or angular. Install the website-scraper-puppeteer library using npm in your terminal:

npm install website-scraper website-scraper-puppeteer

For more information about this project, please visit the official repository at Github here. This plugin, basically starts Chromium in headless mode which just opens page and waits until the entire page is loaded. It is far from ideal because probably you need to wait until some resource is loaded or click some button or log in.

2. Create download script

After installing the plugin on your working directory with NPM, you will just need to create a javascript file with the code that will download some website. Create the index.js file and place the following code inside:

// index.js
const scrape = require('website-scraper');
const PuppeteerPlugin = require('website-scraper-puppeteer');
const path = require('path');

scrape({
    // Provide the URL(s) of the website(s) that you want to clone
    // In this example, you can clone the Our Code World website
    urls: ['https://ourcodeworld.com/'],
    // Specify the path where the content should be saved
    // In this case, in the current directory inside the ourcodeworld dir
    directory: path.resolve(__dirname, 'ourcodeworld'),
    // Load the Puppeteer plugin
    plugins: [ 
        new PuppeteerPlugin({
            launchOptions: { 
                // If you set  this to true, the headless browser will show up on screen
                headless: true
            }, /* optional */
            scrollToBottom: {
                timeout: 10000, 
                viewportN: 10 
            } /* optional */
        })
    ]
});

The previous code will import the installed libraries and the Node.js path helper (to create an absolute path of the current directory). We will call the scrape method, providing as first argument an object with the required configuration to start with the website clonning. The most important options are the urls property, that expects an array of strings, where every item is a web URL of the page of the website that you want to clone. The directory option corresponds to the local directory path where the website content should be placed. The plugins option will load the puppeteer plugin for the regular scrapper in order to clone properly dynamic websites.

3. Run script

Finally, open up your terminal and switch to the directory of the script you've just wrote and execute it:

node index.js

This will clone the desired website, in this case Our Code World. Once it finishes, you will find in the same directory of the script the new directory with all the JavaScript, HTML and CSS of the website. For example, cloning Our Code World will generate a structure similar to:

./
âââ CK7DK53I.json
âââ CK7I4KJM;CK7IP27L
âââ css/
â   âââ bootstrap.css
â   âââ cookieconsent.min.css       
â   âââ custom.css
â   âââ font-awesome.min.css        
â   âââ magnific-popup.css
â   âââ simple-line-icons.css       
â   âââ slick.css
â   âââ style-dark.css
âââ favicon.ico
âââ fonts/
â   âââ Simple-Line-Icons.eot       
â   âââ Simple-Line-Icons.svg       
â   âââ Simple-Line-Icons.ttf       
â   âââ Simple-Line-Icons.woff      
â   âââ Simple-Line-Icons.woff2     
â   âââ fontawesome-webfont.eot     
â   âââ fontawesome-webfont.svg     
â   âââ fontawesome-webfont.ttf     
â   âââ fontawesome-webfont.woff    
â   âââ fontawesome-webfont.woff2   
â   âââ fontawesome-webfont_1.eot   
â   âââ sprite.svg
âââ images/
â   âââ articleocw-5c5a2906da73d.jpg
â   âââ articleocw-5c8fb0eab08af.png
â   âââ articleocw-5c8fe4b534e04.jpg
â   âââ articleocw-5cb14f9ea4cfa.png
â   âââ articleocw-5cb1f4b2bd76b.png
â   âââ articleocw-5cdc8d904e6b9.jpg
â   âââ articleocw-5cdde93a04430.png
â   âââ articleocw-5ce040e91c1a8.png
â   âââ articleocw-5d040d2ec3975.png
â   âââ articleocw-5d200b5b06504.png
â   âââ articleocw-5d45c528f0103.webp
â   âââ articleocw-5d69328bac9a2.webp
â   âââ articleocw-5da07b41aa587.png
â   âââ articleocw-5db79e7faa2c5.webp
â   âââ articleocw-5de93b3040ac4.webp
â   âââ articleocw-5e1df41d2e35b.webp
â   âââ articleocw-5e3caa198aab8.webp
â   âââ articleocw-5e3d7b2a01256.webp
â   âââ articleocw-5e3dbabbcff04.webp
â   âââ articleocw-5e3dd0faa3106.webp
â   âââ articleocw-5e4162f1d2db6.webp
â   âââ articleocw-5e418ee7e81b4.png
â   âââ bestfreehtmlcsstemplates_banner_quad.png
â   âââ graph-bg.png
â   âââ hero-slide-1.jpg
â   âââ hero-slide-2.jpg
â   âââ hero-slide-3.jpg
â   âââ home_bg.jpg
â   âââ jobble.png
â   âââ login_register_bg.jpg
â   âââ login_register_bg_1.jpg
â   âââ main-news-banner__bg.jpg
â   âââ mini_ads.png
â   âââ mini_bestfreehtmlcsstemplates.png
â   âââ mini_wrapbootstrap.png
â   âââ ocw_logo_255.png
â   âââ page-heading-pattern.png
â   âââ page-heading.jpg
â   âââ rosterv3_player_01-bg.png
â   âââ single-post-img5.jpg
â   âââ team-roster-slider-bg.jpg
âââ index.html
âââ js/
â   âââ adsbygoogle.js
â   âââ analytics.js
â   âââ bootstrap.bundle.js
â   âââ bsa.js
â   âââ cookie.js
â   âââ cookieconsent.min.js
â   âââ core.js
â   âââ custom.js
â   âââ init.js
â   âââ integrator.js
â   âââ integrator_1.js
â   âââ jquery.min.js
â   âââ monetization.js
â   âââ osd.js
â   âââ pro.js
â   âââ s_83085e49dfeedee6628ee5a7d7cd7359.js
â   âââ show_ads_impl_fy2019.js
âââ js_1/
âââ raw_83a93c31c68198a3762e2237ff33e529.html
âââ raw_b54f5852f835e7a023fcacceb1b6473c.html
âââ zrt_lookup.html

Be sure to remove all the JavaScript of ads, videos and analytics that you may find in some websites to prevent JavaScript exception that would raise exceptions in the cloned website.

Happy coding !