How to clone a website (download HTML,CSS, JavaScript, Fonts and Images) using Website Scraper in Node.js

How to clone a website (download HTML,CSS, JavaScript, Fonts and Images) using Website Scraper in Node.js

How many times, as frontend developers, we decided to implement a copy of some feature that a third party website has in our own website? I usually do this a lot, specially when there's no open source alternative for the feature and i don't want to write it from scratch because it would take a lot of time to end up with something decent. The easiest way to do this is by reading the source code of a web page, using Ctrl + U in Chrome for example, reading the JavaScript files as well as long as they're not minified (in websites that sell templates for example).

This is only uncomfortable when we talk about code highlighting, as you can't compare the syntax highlighting offered by the browser with the one offered by your favorite IDE like Visual Studio Code, Netbeans etc. So, it would be great if you could download a copy of the code and resources of the website to manipulate it locally right? Thanks to a pretty useful script that uses Puppeteer, this can be easily done within just minutes (and seconds after its implementation). In this article, we will explain you how to easily implement your own website cloner with Node.js.

1. Install Website Scrapper Puppeteer

One of the most valuable advantanges of using a script that is based on Puppeteer, a headless version of Chromium, is that you will not only be able to copy static website features that implement plain JavaScript or even jQuery, but you will also be able to download the content and resources generated by dynamic pages that use React or angular. Install the website-scraper-puppeteer library using npm in your terminal:

npm install website-scraper website-scraper-puppeteer

For more information about this project, please visit the official repository at Github here. This plugin, basically starts Chromium in headless mode which just opens page and waits until the entire page is loaded. It is far from ideal because probably you need to wait until some resource is loaded or click some button or log in.

2. Create download script

After installing the plugin on your working directory with NPM, you will just need to create a javascript file with the code that will download some website. Create the index.js file and place the following code inside:

// index.js
const scrape = require('website-scraper');
const PuppeteerPlugin = require('website-scraper-puppeteer');
const path = require('path');

scrape({
    // Provide the URL(s) of the website(s) that you want to clone
    // In this example, you can clone the Our Code World website
    urls: ['https://ourcodeworld.com/'],
    // Specify the path where the content should be saved
    // In this case, in the current directory inside the ourcodeworld dir
    directory: path.resolve(__dirname, 'ourcodeworld'),
    // Load the Puppeteer plugin
    plugins: [ 
        new PuppeteerPlugin({
            launchOptions: { 
                // If you set  this to true, the headless browser will show up on screen
                headless: true
            }, /* optional */
            scrollToBottom: {
                timeout: 10000, 
                viewportN: 10 
            } /* optional */
        })
    ]
});

The previous code will import the installed libraries and the Node.js path helper (to create an absolute path of the current directory). We will call the scrape method, providing as first argument an object with the required configuration to start with the website clonning. The most important options are the urls property, that expects an array of strings, where every item is a web URL of the page of the website that you want to clone. The directory option corresponds to the local directory path where the website content should be placed. The plugins option will load the puppeteer plugin for the regular scrapper in order to clone properly dynamic websites.

3. Run script

Finally, open up your terminal and switch to the directory of the script you've just wrote and execute it:

node index.js

This will clone the desired website, in this case Our Code World. Once it finishes, you will find in the same directory of the script the new directory with all the JavaScript, HTML and CSS of the website. For example, cloning Our Code World will generate a structure similar to:

./
├── CK7DK53I.json
├── CK7I4KJM;CK7IP27L
├── css/
│   ├── bootstrap.css
│   ├── cookieconsent.min.css       
│   ├── custom.css
│   ├── font-awesome.min.css        
│   ├── magnific-popup.css
│   ├── simple-line-icons.css       
│   ├── slick.css
│   └── style-dark.css
├── favicon.ico
├── fonts/
│   ├── Simple-Line-Icons.eot       
│   ├── Simple-Line-Icons.svg       
│   ├── Simple-Line-Icons.ttf       
│   ├── Simple-Line-Icons.woff      
│   ├── Simple-Line-Icons.woff2     
│   ├── fontawesome-webfont.eot     
│   ├── fontawesome-webfont.svg     
│   ├── fontawesome-webfont.ttf     
│   ├── fontawesome-webfont.woff    
│   ├── fontawesome-webfont.woff2   
│   ├── fontawesome-webfont_1.eot   
│   └── sprite.svg
├── images/
│   ├── articleocw-5c5a2906da73d.jpg
│   ├── articleocw-5c8fb0eab08af.png
│   ├── articleocw-5c8fe4b534e04.jpg
│   ├── articleocw-5cb14f9ea4cfa.png
│   ├── articleocw-5cb1f4b2bd76b.png
│   ├── articleocw-5cdc8d904e6b9.jpg
│   ├── articleocw-5cdde93a04430.png
│   ├── articleocw-5ce040e91c1a8.png
│   ├── articleocw-5d040d2ec3975.png
│   ├── articleocw-5d200b5b06504.png
│   ├── articleocw-5d45c528f0103.webp
│   ├── articleocw-5d69328bac9a2.webp
│   ├── articleocw-5da07b41aa587.png
│   ├── articleocw-5db79e7faa2c5.webp
│   ├── articleocw-5de93b3040ac4.webp
│   ├── articleocw-5e1df41d2e35b.webp
│   ├── articleocw-5e3caa198aab8.webp
│   ├── articleocw-5e3d7b2a01256.webp
│   ├── articleocw-5e3dbabbcff04.webp
│   ├── articleocw-5e3dd0faa3106.webp
│   ├── articleocw-5e4162f1d2db6.webp
│   ├── articleocw-5e418ee7e81b4.png
│   ├── bestfreehtmlcsstemplates_banner_quad.png
│   ├── graph-bg.png
│   ├── hero-slide-1.jpg
│   ├── hero-slide-2.jpg
│   ├── hero-slide-3.jpg
│   ├── home_bg.jpg
│   ├── jobble.png
│   ├── login_register_bg.jpg
│   ├── login_register_bg_1.jpg
│   ├── main-news-banner__bg.jpg
│   ├── mini_ads.png
│   ├── mini_bestfreehtmlcsstemplates.png
│   ├── mini_wrapbootstrap.png
│   ├── ocw_logo_255.png
│   ├── page-heading-pattern.png
│   ├── page-heading.jpg
│   ├── rosterv3_player_01-bg.png
│   ├── single-post-img5.jpg
│   └── team-roster-slider-bg.jpg
├── index.html
├── js/
│   ├── adsbygoogle.js
│   ├── analytics.js
│   ├── bootstrap.bundle.js
│   ├── bsa.js
│   ├── cookie.js
│   ├── cookieconsent.min.js
│   ├── core.js
│   ├── custom.js
│   ├── init.js
│   ├── integrator.js
│   ├── integrator_1.js
│   ├── jquery.min.js
│   ├── monetization.js
│   ├── osd.js
│   ├── pro.js
│   ├── s_83085e49dfeedee6628ee5a7d7cd7359.js
│   └── show_ads_impl_fy2019.js
├── js_1/
├── raw_83a93c31c68198a3762e2237ff33e529.html
├── raw_b54f5852f835e7a023fcacceb1b6473c.html
└── zrt_lookup.html

Be sure to remove all the JavaScript of ads, videos and analytics that you may find in some websites to prevent JavaScript exception that would raise exceptions in the cloned website.

Happy coding !

This could interest you

Become a more social person