How to keep UTF-8 special characters and Emojis in your HTML manipulated by PHP's class DOMDocument

How to keep UTF-8 special characters and Emojis in your HTML manipulated by PHP's class DOMDocument

A couple of months ago, I had to modify dynamically every article of the blog to include the lazy loading feature in images of articles that were written a couple of years ago. I used the hardly criticized DOMDocument class of PHP to do the job and ended up discovering why it's so hated. Besides not parsing correctly HTML5 elements, you will have a problem with the character encoding in most of the cases with special characters and emojis, for example, check the following PHP code that should simply parse the given HTML and print it again, without doing absolutely nothing:

<?php

$html = <<<EOF
<html>
    <body>
        <p>Entwickeln Sie mit Vergnügen.</p>
        <p>
            Герой Лентул
            любит лежебочить;
            Зато ни в чём другом нельзя его порочить:
            Не зол, не сварлив он, отдать последне рад
            И если бы не лень, в мужьях он был бы клад;
            Приветлив и учтив, при том и не невежа,
            Рад сделать всё добро, да только бы лишь лежа.
        </p>
        <pre><code class="language-treeview">
            C:.
            └───Depix
                ├───depixlib
                │   └───__pycache__
                ├───docs
                │   └───img
                ├───example
                └───images
                    ├───searchimages
                    └───testimages
        </code></pre>
        <p>Happy coding ❤️!</p>
    </body>
</html>
EOF;

// Create a DOMDocument instance
$doc = new DOMDocument();

// Load your plain HTML
$doc->loadHTML($html);

// Imagine that you modify some stuff of the HTML and so on ...
// And finally decide to export it using saveHTML
$finalHTML = $doc->saveHTML();

// Print the output in the browser
echo $finalHTML;

The content printed by the last line, extracted from the saveHTML method of the DOMDocument will mess up the output in the document, just like the first image of the article. If you test it as well, it won't look at all like the HTML we provided at the beginning, isn't it? In this article, I will show you some options that you may use to export properly the parsed HTML by the DOMDocument class of PHP.

A. utf8_decode the documentElement

In this approach, you will use the same code as usual, however, be sure to pass as the first argument of the saveHTML method, the current document model, and use utf8_decode to convert the string with ISO-8859-1 characters encoded with UTF-8:

<?php

$html = <<<EOF
<html>
    <body>
        <p>Entwickeln Sie mit Vergnügen.</p>
        <p>
            Герой Лентул
            любит лежебочить;
            Зато ни в чём другом нельзя его порочить:
            Не зол, не сварлив он, отдать последне рад
            И если бы не лень, в мужьях он был бы клад;
            Приветлив и учтив, при том и не невежа,
            Рад сделать всё добро, да только бы лишь лежа.
        </p>
        <pre><code class="language-treeview">
            C:.
            └───Depix
                ├───depixlib
                │   └───__pycache__
                ├───docs
                │   └───img
                ├───example
                └───images
                    ├───searchimages
                    └───testimages
        </code></pre>
        <p>Happy coding ❤️!</p>
    </body>
</html>
EOF;

// Create a DOMDocument instance
$doc = new DOMDocument();

// Load your plain HTML
$doc->loadHTML($html);

// utf8_decode converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1
// We will provide as first argument of the saveHTML method, the current documentElement
// Imagine that you modify some stuff of the HTML and so on ...
// And finally decide to export it using saveHTML
$finalHTML = utf8_decode($doc->saveHTML($doc->documentElement));

// Print the output in the browser!
echo $finalHTML;

B. Use the SmartDocument hack

The SmartDocument class is a class that overcomes a few common annoyances with the DOMDocument class of PHP. The source code of the project can be found in this repository at Github. You won't need to include the class in your document, but just as the code of the SmartDocument specifies while loading the HTML, before loading it into the DOMDocument, the HTML encoding needs to be changed using mb_convert_encoding just like we'll do in the following example:

<?php

$html = <<<EOF
<html>
    <body>
        <p>Entwickeln Sie mit Vergnügen.</p>
        <p>
            Герой Лентул
            любит лежебочить;
            Зато ни в чём другом нельзя его порочить:
            Не зол, не сварлив он, отдать последне рад
            И если бы не лень, в мужьях он был бы клад;
            Приветлив и учтив, при том и не невежа,
            Рад сделать всё добро, да только бы лишь лежа.
        </p>
        <pre><code class="language-treeview">
            C:.
            └───Depix
                ├───depixlib
                │   └───__pycache__
                ├───docs
                │   └───img
                ├───example
                └───images
                    ├───searchimages
                    └───testimages
        </code></pre>
        <p>Happy coding ❤️!</p>
    </body>
</html>
EOF;

// Create a DOMDocument instance
$doc = new DOMDocument();

// Load your plain HTML, however with the 
// Converts the character encoding of $html to HTML-ENTITIES from optionally UTF-8. 
$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

// Imagine that you modify some stuff of the HTML and so on ...
// And finally decide to export it using saveHTML
$finalHTML = $doc->saveHTML();

// Print the output in the browser!
echo $finalHTML;

Happy coding ❤️!

This could interest you

Become a more social person