Learn how to prevent XSS attacks and remove disallowed html nodes from the parsed markdown using a markdown parser library in PHP like Parsedown.

How to prevent XSS attacks and disallow specific tags in HTML generated by a PHP Markdown parser

Markdown uses a very user friendly formatting syntax to accomplish the same thing that HTML or Rich Text Formatting does and it's parse process in the server side is pretty easy too with the right tools. However if you don't handle correctly the generated HTML by your favorite Markdown parser, you will be in trouble if someone discovers that your application is vulnerable to an XSS attack. A XSS attacks refers to a code injection attack where an attacker can execute malicious scripts into a website or web application. The point is that a XSS vulnerability can only be exploited if the payload (the malicious script) that the attacker inserts, get parsed as HTML in the browser of the victim.

Surely you have heard about the PHP functions htmlentities, htmlspecialchars etc, functions that allow you to encode and decode symbols used on HTML, but it's too problematic to work with it by yourself if you don't have much time. Mainly, a correct functionality should be providen by your parser, however they need to be flexible and useful too, but sometimes due to the characteristics of your project it may not work 100% correctly. For example, the following markdown:

# Hello World

I am a programmer and i wrant to write code. I don't write code with bad intentions, just share my knowledge

```html
<script>
    alert("First alert");
</script>
```

<script>alert("second alert");</script>

<img src="http://url.to.file.which/not.exist" onerror=alert(document.cookie);>

Parsed into HTML using the markdown parser by cebeb and rendered in the browser, will only alert "second alert". That's because the parser is smart enough to convert all the content inside a code block into its respective HTML entities automatically. However the script tag outside of any code block still being interpreted by the browser as JavaScript which is obviously a problem and the img loaded from an non existing file will be triggered too.

In this article we'll show you how to prevent the insertion from JavaScript

Solution with Parsedown

Parsedown makes the things really easy for you by escaping all the markup inside the markdown that you provide to parse. However it isn't enabled by default and it's not enough to prevent all the ways of XSS attacks. For this reason, we recommend you, in case you want to use a secure version of Parsedown, to use the secureparsedown package. The changes made by Aidan Woods to the original parsedown library provides an implementation of a safe mode that will protect the HTML from being vulnerable to XSS.

In this case, the version 1.7.0 with the changes of Aidan has been not published, so till the date you can download the secure version of parsedown from the following package:

composer require aidantwoods/secureparsedown

The advantage from using this version is that it will use the latest version of Parsedown but it creates the secure mode too. Or if you prefer, modify manually your composer.json file and then run composer install:

{
    "require": {
        "aidantwoods/secureparsedown": "^1.0"
    }
}

Once installed, use Parsedown as you usually do and don't forget to use the setMarkupEscaped and setSafeMode methods to provide a safe HTML to render:

<?php

use Aidantwoods\SecureParsedown\SecureParsedown;

$markdown = "# Title <script>alert('XSS Attack ...');</script>";

$Parsedown = new SecureParsedown;

// Escape the input markdown to prevent any html from being and enable parsedown in safe mode
$Parsedown->setSafeMode(true);

// Secure HTML
echo $Parsedown->text($markdown);

This comes in handy because all the text inside the Markdown interpreted as HTML will be converted to its inoffensive html entity representation (which prevents XSS attacks). However this features is available only with this library. If you want to prevent some specific tags from appearing, then please implement the solution from other parser libraries too.

Solution with other parser libraries

If you are using other Markdown parser in PHP, then it won't probably have the same feature of Parsedown. In this case, you will need to install the htmlpurifier library. HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications.

To install htmlpurifier in your project with Composer, execute the following command on your terminal:

composer require "ezyang/htmlpurifier":"dev-master"

Once the library has been installed on your project using composer, you will be able to use it easily. The easiest way to achieve your goal from removing script tags from the HTML generated from the markdown processed by your parser, is by forbidding specific tags with the option HTML.ForbiddenElements (where every tag is an item of the argument array):

<?php 

// The source of your markdown
$markdown = "# Title <script>alert('Super unsafe markdown ...');</script>";

// Create an instance of your parser if available ...
$parser = new YourFavoriteMarkdownParserExample();

// This contains the generated HTML from your markdown
$parsedMarkdown = $parser->parse($markdown);

// Initialize a config object of html purifier
$config = HTMLPurifier_Config::createDefault();

// The HTML nodes that you want to prevent from being rendered
// as second argument within an array
$config->set('HTML.ForbiddenElements', array('script','applet'));

// Initialize html purifier
$purifier = new HTMLPurifier($config);

// Purify the generated HTML and
// Use this safe HTML to display in the browser !
$HTMLWithoutForbiddenTags = $purifier->purify($parsedMarkdown);

Alternatively, if you want total control of the tags that will be rendered, then you can decide which HTML tags can be rendered and which attributes:

<?php 

// The source of your markdown
$markdown = "# Title <script>alert('Super unsafe markdown ...');</script>";

$parser = new YourFavoriteMarkdownParserExample();
        
$config = \HTMLPurifier_Config::createDefault();

// Allow Text without tag e.g P or DIV (plain text, obviously necessary for markdown)
$config->set('Core.LexerImpl', 'DirectLex');

// Define manually which elements can be rendered
// In this example, we allow (almost) all the basic elements that are converted with markdown
$config->set('HTML.Allowed', 'h1,h2,h3,h4,h5,h6,br,b,i,strong,em,a,pre,code,img,tt,div,ins,del,sup,sub,p,ol,ul,table,thead,tbody,tfoot,blockquote,dl,dt,dd,kbd,q,samp,var,hr,li,tr,td,th,s,strike');
// The attributes are up to you
$config->set('HTML.AllowedAttributes', 'img.src,*.style,*.class, code.class,a.href,*.target');

// Create an instance of the purifier with the configuration
$purifier = new \HTMLPurifier($config);

// Print the purified HTML 
echo $purifier->purify($parser->text($markdown));

Happy coding !


Senior Software Engineer at Software Medico. Interested in programming since he was 14 years old, Carlos is a self-taught programmer and founder and author of most of the articles at Our Code World.

Sponsors