How to convert PDF to Text (extract text from PDF) with PHP in Symfony 3

How to convert PDF to Text (extract text from PDF) with PHP in Symfony 3

If you work with Portable Document Format files (PDFs), the user of your system may want to extract all the text from a PDF file. So the user doesn't have to select all the text of a PDF with the mouse and then do something with it as you can automate this action with JavaScript in your browser. If you don't want to extract the text of a PDF in the browser with JavaScript because you care about the user experience, then you may want to do it in the server side.

In this article you will learn how to extract the text from a PDF in the server side with PHP in your Symfony 3 project using the PDF Parser library. Although there other libraries that can help you to extract the text like pdf-to-text by @spatie, that works like a charm too, PDF Parser is a better way to proceed as it's very easy to install, to use and don't have any software dependency (if you use the pdf-to-text library by spatie then you will need to install pdftotext in your machine as the library is a wrapper for the utility).

Let's get started !

1. Install PDF Parser

PdfParser is an awesome standalone PHP library that provides various tools to extract data from a PDF file. Some features of PDF parser are:

  • Load/parse objects and headers
  • Extract meta data (author, description, ...)
  • Extract text from ordered pages
  • Support of compressed pdf
  • Support of MAC OS Roman charset encoding
  • Handling of hexa and octal encoding in text sections
  • PSR-0 compliant (autoloader)
  • PSR-1 compliant (code styling)

You can even test how the library works in this page. The only limitation of this parser is that it can't handle secured documents.

The preferred way to install this library is via Composer. Open a new terminal, switch to the directory of your project and execute the following command on it:

composer require smalot/pdfparser

If you don't like to install new libraries directly with the terminal on your project, you can still modify the composer.json file and add the dependency manually:

{
    "require": {
        "smalot/pdfparser": "*"
    }
}

Save the changes and then execute composer install in your terminal. Once the installation finishes, you will be able to extract the text from a PDF easily.

If you need more information about the PDF Parser library, please visit the official repository in Github here or their website here.

2. Extracting the text

The extraction of text with PDFParse is pretty easy, you only need to create an instance of the Smalot\PdfParser\Parser class and then load the PDF file from its absolute or relative path, the parsed file should be stored on a variable and then this object will allow you to handle the PDF by pages. You can extract directly all the text from the entire PDF or separately by pages.

Checkout the following examples:

Note

As we are working with symfony, we can retrieve the path of the /web folder in the project using $this->get('kernel')->getRootDir() . '/../web' as long as you are within a controller.

Extract all the text from all pages

You can extract all the text from a PDF using the getText method available in the PDF instance:

<?php

namespace AppBundle\Controller;

use Sensio\Bundle\FrameworkExtraBundle\Configuration\Route;
use Symfony\Bundle\FrameworkBundle\Controller\Controller;
use Symfony\Component\HttpFoundation\Request;
use Symfony\Component\HttpFoundation\Response;

/**
 * Import the PDF Parser class
 */
use Smalot\PdfParser\Parser;


class DefaultController extends Controller
{
    /**
     * @Route("/", name="homepage")
     */
    public function indexAction(Request $request)
    {
        // The relative or absolute path to the PDF file
        $pdfFilePath = $this->get('kernel')->getRootDir() . '/../web/example.pdf';

        // Create an instance of the PDFParser
        $PDFParser = new Parser();

        // Create an instance of the PDF with the parseFile method of the parser
        // this method expects as first argument the path to the PDF file
        $pdf = $PDFParser->parseFile($pdfFilePath);
        
        // Extract ALL text with the getText method
        $text = $pdf->getText();

        // Send the text as response in the controller
        return new Response($text);
    }
}

Iterate through every page of the PDF and extract text

If you want to handle separately every page of the PDF, you can iterate through the array of pages that you can retrieve with the getPages method of the PDF instance:

<?php

namespace AppBundle\Controller;

use Sensio\Bundle\FrameworkExtraBundle\Configuration\Route;
use Symfony\Bundle\FrameworkBundle\Controller\Controller;
use Symfony\Component\HttpFoundation\Request;
use Symfony\Component\HttpFoundation\Response;

/**
 * Import the PDF Parser class
 */
use Smalot\PdfParser\Parser;


class DefaultController extends Controller
{
    /**
     * @Route("/", name="homepage")
     */
    public function indexAction(Request $request)
    {
        // The relative or absolute path to the PDF file
        $pdfFilePath = $this->get('kernel')->getRootDir() . '/../web/example.pdf';

        // Create an instance of the PDFParser
        $PDFParser = new Parser();

        // Create an instance of the PDF with the parseFile method of the parser
        // this method expects as first argument the path to the PDF file
        $pdf = $PDFParser->parseFile($pdfFilePath);

        // Retrieve all pages from the pdf file.
        $pages  = $pdf->getPages();

        // Retrieve the number of pages by counting the array
        $totalPages = count($pages);

        // Set the current page as the first (a counter)
        $currentPage = 1;

        // Create an empty variable that will store thefinal text
        $text = "";
         
        // Loop over each page to extract the text
        foreach ($pages as $page) {

            // Add a HTML separator per page e.g Page 1/14
            $text .= "<h3>Page $currentPage/$totalPages</h3> </br>";

            // Concatenate the text
            $text .= $page->getText();

            // Increment the page counter
            $currentPage++;
        }
 
        // Send the text as response in the controller
        return new Response($text);
    }
}

You can retrieve the text from a page in array format (where every item in the array is a new line) using the getTextArray method instead of getText.

Extract the text from a specific page in the PDF

Although there's no method to access directly a page by its number, you can simply access it in the array of pages with the getPages method of the PDF instance. This array is ordered in the same way that the PDF (index 0 equal to the page #1 of the PDF) so you can access the page by retrieving it from the array with the index.

Note that you need to verify if the index (number of page) in the pages array exists, otherwise you will get an exception:

<?php

namespace AppBundle\Controller;

use Sensio\Bundle\FrameworkExtraBundle\Configuration\Route;
use Symfony\Bundle\FrameworkBundle\Controller\Controller;
use Symfony\Component\HttpFoundation\Request;
use Symfony\Component\HttpFoundation\Response;

/**
 * Import the PDF Parser class
 */
use Smalot\PdfParser\Parser;


class DefaultController extends Controller
{
    /**
     * @Route("/", name="homepage")
     */
    public function indexAction(Request $request)
    {
        // The relative or absolute path to the PDF file
        $pdfFilePath = $this->get('kernel')->getRootDir() . '/../web/example.pdf';

        // Create an instance of the PDFParser
        $PDFParser = new Parser();

        // Create an instance of the PDF with the parseFile method of the parser
        // this method expects as first argument the path to the PDF file
        $pdf = $PDFParser->parseFile($pdfFilePath);

        // Get all the pages of the PDF
        $pages = $pdf->getPages();
        
        // Let's extract the text of the page #2 of the PDF
        $customPageNumber = 2;

        // If the page exist, then extract the text
        // As every array starts with 0 add +1
        if(isset($pages[$customPageNumber + 1])){
          
            // As every array starts with 0 add +1
            $pageNumberTwo = $pdf->getPages()[$customPageNumber + 1];

            // Extract the text of the page #2
            $text = $pageNumberTwo->getText();

            // Send the text as response in the controller
            return new Response($text);

        }else{
            return new Response("Sorry the page #$customPageNumber doesn't exist");
        }
    }
}

Happy coding !

Become a more social person