How to efficiently read and parse a huge CSV file line by line in PHP

How to efficiently read and parse a huge CSV file line by line in PHP

If you work for a company that offers development services even for other companies of the same industry, you may have had this "wonderful" task of importing a huge "database" from a client into the database engine preferred by your company. For example, in our company, we work with MySQL and our client came up with a CSV file of approximately 25GB with ~7.5M rows.

What?

Obviously, because of the logic that the project required, we couldn't just simply import the file into the database through a tool like PHPMyAdmin, as every row in the CSV should be modificated to fit with our new database design.

In this article, we will explain you our approach for reading efficiently a huge CSV file in PHP.

1. Split your file into smaller chunks

To get started, when we talk about huge files, we are not talking about files with 50K or 70K rows, we talk about millions of rows like in this example, with a CSV file of 25GB. So, the correct approach for such cases is not to work with the file directly, but with smaller files.

The smaller the file, the better will it be to have an optimal performance and control over your script, not only about the performance perspective but the logic as well. We wrote an article previously of how to split huge CSV datasets into smaller chunks using CSV splitter, a tool for Windows 10. You can of course do the same using another approach, but, you get the idea right? Split the file into smaller chunks that can be easily processed by your scripts later.

2. Implementing the read and iteration script

To read the file, we will use the fopen function of PHP, this inbuilt function is used to simply open a file from a local URL, it's used to bind a resource to a steam. It expects as second argument the mode in which we'll operate, in this case, just reading with the r identifier. The method returns a file pointer as long as the file exists, otherwise it will return False in case of failure.

We will read the file using this method and will store the pointer into the $handle variable. Create as well the variable that will store the current line number as we'll iterate over the rows with an entry controlled loop (while). With a while loop, we will iterate over every single row of the file, verifying the condition that fgets always return some content.

The fgets function of PHP returns a line from an open file with fopen and it returns false when there's nothing left to read. Then, inside the while loop, you will be able to parse the raw CSV string with the str_getcsv function. Having the basic stuff implemented, you will be ready to modify the script to do whatever you need to do with the best possible performance in PHP:

// Read a CSV file
$handle = fopen("my_huge_csv_file.csv", "r");

// Optionally, you can keep the number of the line where
// the loop its currently iterating over
$lineNumber = 1;

// Iterate over every line of the file
while (($raw_string = fgets($handle)) !== false) {
    // Parse the raw csv string: "1, a, b, c"
    $row = str_getcsv($raw_string);

    // into an array: ['1', 'a', 'b', 'c']
    // And do what you need to do with every line
    var_dump($row);
    
    // Increase the current line
    $lineNumber++;
}

fclose($handle);

The advantages of this approach are:

  • You are not directly reading the entire file in memory just like file_get_contents does, so the max amount of memory needed to run the script depends on the longest line in the input data.
  • Quite easy to read and understand.

Happy coding ❤️!

References and external resources

This could interest you

Become a more social person