How to retrieve machine readable zones from a passport image with Python using the PassportEye library

Learn how to retrieve the mrz data from an image of a passport with Python using the PassportEye library.

A machine-readable passport (MRP) is a machine-readable travel document (MRTD) with the data on the identity page encoded in optical character recognition format Most travel passports worldwide are MRPs. It can have 2 lines or 3 lines of machine-readable data. This method allows to process MRZ written in accordance with ICAO Document 9303 (endorsed by the International Organization for Standardization and the International Electrotechnical Commission as ISO/IEC 7501-1)). Some applications will need to be able to scan such data of someway, so one of the easiest methods is to recognize it from an image file.

In this article, we'll show you how to retrieve the MRZ information from a picture of a passport using the PassportEye library of Python.

Requirements

You will need the OCR Engine Tesseract installed on your system and available from the PATH. You can install this tool on any system (unix or windows) easily, so check the official repository at Github here.

You can check as well the availability of tesseract from the cli with the following command:

tesseract --help

1. Install PassportEye

PassportEye is a python library for image processing of identification documents that use the machine readable travel format. This package provides a kit of tools for recognizing machine readable zones (MRZ) from scanned identification documents. The documents may be located rather arbitrarily on the page - the code tries to find anything resembling a MRZ and parse it from there. The recognition procedure may be rather slow - around 10 or more seconds for some documents. Its precision is not perfect, yet seemingly decent as far as test documents available to the developer were concerned - in around 80% of the cases, whenever there is a clearly visible MRZ on a page, the system will recognize it and extract the text to the best of the abilities of the underlying OCR engine (Google Tesseract).

You can install this library using the following command:

pip install PassportEye

The installation process will take a while. For more information about this library, please visit the official repository at Github here.

2. Using PassportEye from the CLI

The PassportEye library will expose globally the mrz command, this tool will process a given filename, extracting the MRZ information it finds and printing it out in tabular form. Running mrz --json <filename> will output the same information in JSON. Running mrz --save-roi <roi.png> will, in addition, extract the detected MRZ ("region of interest") into a separate png file for further exploration. Note that the tool provides a limited support for PDF files -- it attempts to extract the first DCT-encoded image from the PDF and applies the recognition on it. This seems to work fine with most scanner-produced one-page PDFs, but has not been tested extensively.

The most basic usage of the command is the following:

mrz image.jpg

However the output won't be well formated as the key and value will be separated by a single space, instead you can format it nicely and make it available to be processed by some programming language with the json format. You can generate the output in this format, adding the --json flag to the command:

mrz image.jpg --json

This will output with the following image for example:

The following data:

{
  "mrz_type": "TD3",
  "valid_score": 62,
  "type": "P<",
  "country": "PRT",
  "number": "1700044<<",
  "date_of_birth": "740407",
  "expiration_date": "220616",
  "nationality": "PRT",
  "sex": "F",
  "names": "INES",
  "surname": "GARCAO DE MAGALHAES",
  "personal_number": "99999999<<<<<<",
  "check_number": "9",
  "check_date_of_birth": "6",
  "check_expiration_date": "1",
  "check_composite": "0",
  "check_personal_number": "8",
  "valid_number": false,
  "valid_date_of_birth": true,
  "valid_expiration_date": true,
  "valid_composite": false,
  "valid_personal_number": true,
  "method": "direct",
  "walltime": 2.2025797367095947,
  "filename": "image.jpg"
}

3. Using the python API

If you want to integrate this tool within your python code, then you will need to follow a pretty simple logic. Impor the read_mrz function from the PassportEye library and provide as first argument the path to the image that you want to process (it can be either a path to a file on disk, or a byte stream containing image data). After obtaining the result, cast the to_dict method to obtain the mrz data and obtain it with the keys (use the keys mentioned on the previous step in the JSON string):

# Import PassportEye
from passporteye import read_mrz

# Process image
mrz = read_mrz("passport_image.jpg")

# Obtain image
mrz_data = mrz.to_dict()

print(mrz_data['country'])
print(mrz_data['names'])
print(mrz_data['surname'])
print(mrz_data['type'])
# And so on with the rest of shown properties in the previous JSON string

Happy coding !