Using regular expressions versus string operations

Mar 5, 2019 by Jeroen Deviaene

I have been working on a new project that will allow me to connect to an IRC server using a pure PHP library. To get some peer reviews, I showed the project to some friends and got the commentary that I shouldn’t use regex to parse the messages because regexes are really slow!
I am aware that regexes can indeed be slower than parsing strings with string operations but always thought that the differences would be insignificant. So, to clear this out, I decided to do some benchmarks to find out if I should use less regular expressions.

The setup

I wanted the tests to be in the same context as the reason why I did this test. So the first thing I did was to make a random IRC command generator. All generated commands were stored in a file separated by a newline character.

I took the existing IrcMessage class from my repository. I also made a second class with the exact same blueprint but instead of a regex, the message was being parsed using mostly strpos and substr. To make sure that the results were correct, I tested both classes using the unit tests of the IRC client library.

The testing script was the simplest of all. It reads the lines of the data file one by one and passes them to the message parser. Once all lines are parsed, it outputs the time that has passed.

//require_once 'IrcMessageRegex.php';
require_once 'IrcMessageParser.php';

$time = microtime(true);
$counter = 0;
$file = fopen('data.txt', 'r');

while (($line = fgets($file)) !== false) {
    $message = new IrcMessage($line);
    $counter++;
}

fclose($file);

$time = round(microtime(true) - $time, 6);
echo 'Parsed ' . number_format($counter, 0, ',', '.') . " messages in $time seconds.\n";

The results

To no one’s surprise, the regex was slower than the string operations. But to my surprise, the difference was substantial. Manually parsing the commands was almost double as fast as the regex parsing.

Regex:   Parsed 10.000 messages in 0.097482 seconds.
Manual:  Parsed 10.000 messages in 0.054246 seconds.

Regex:   Parsed 1.000.000 messages in 9.643481 seconds.
Manual:  Parsed 1.000.000 messages in 5.466605 seconds.

So today I learned my lesson. If the string isn’t too complex to parse manually, maybe it shouldn’t be done using a regex. After fiddling a bit with the code, the manual version of the IRC message parser was actually no more than 25 lines of code.

In the next few days, I will be removing all regexes from the message parsers in my IRC client. And in the future, I shall think twice before solving simple string operations with a regular expression.