In a previous post on Babble's development, I mentioned that the SubstituteAndReturn plugin was slow. Part of this is because the PPR grammar can find the substitution/transliteration operators but not provide the information about where in an expression they are which is needed to determine if the operation is contextual on the $_ variable.

However before I jump into optimising that, it is good to profile the code to find where it is being the slowest. I fired up Devel::NYTProf and found that the code was taking the longest under Perl v5.26 with the CORE:regcomp subroutine (Perl's internal regexp compilation). Usually, when a regexp does not contain any interpolated variables this is done at compile time; however, since Babble constructs regexps at runtime, this CORE:regcomp is called many times. So I added caching of the compiled regex in two places where the profiler output indicated the most time was being spent. This immediately led to a 30%-70% reduction in execution time for tests depending on how heavily the code generates grammar regexps (with the greatest reduction for the postfixderef and substituteandreturn tests). Processing Dist::Zilla and its dependencies went from 15 minutes down to 9 minutes, but this result is not using the cache to its full potential because I currently process each file in separate interpreters so the cache is per-file instead of over the entire run.

Returning to the SubstituteAndReturn plugin, the approach used in Babble-0.090008 was looping over the document and inserting a $_ =~ explicit binding until there were no more places a binding could be inserted. Upon looking more closely at how the code ran, I realized that the combination of each_match_within(Expression => ... ) with the pattern that I used could only find one position to place that explicit binding per iteration; it was doing too much and this was slowing it down. So I replaced that with each_match_of( Expression => ... and extract the substitution/transliteration position myself in this PR. This reduces the processing of Dist::Zilla from 9 minutes to 6 minutes.

I then created a continuous integration setup to make sure that it worked across Perl versions. This is is when I discovered something shocking! The tests were running up to 10 times longer on Perl 5.22–5.36 relative to 5.10–5.18 (40 versus 4 seconds overall). I knew that this slowdown is documented as a PPR limitation but I was not timing just how much slower it was until now. I ran the profiler again and I saw that the slowdown was entirely attributable to regex compilation not matching speed.

This means I should be able to use the caching that I added to make up for the difference as long as all files are modified in a single process. I rewrote the bash script that uses GNU Parallel in a single-file-per-process fashion to a Perl script that uses MCE which uses fork(2) by default on Unix-likes. While this does introduce IPC overhead, it turned out to be faster than repeatedly spawning processes. Using fork(2) opened up another option for optimisation: I could warm up the cache prior to the fork(2) and use that cache in the child processes — now we're cooking. I added cases to warm up the cache until the child processes were no longer reporting cache misses.

When running Babble across many files, it quickly becomes apparent that most files are unchanged by the plugins so you can gain a speed up by adding a small check for the syntax being targeted by the plugin. I added these as "early" and "late" bail outs which can be turned off using an environment variable. These are meant to be fast checks that look for short matches without using the full PPR grammar. Early bail outs look over the entire document while late bail outs are used inside the transformation. For example, the SubstituteAndReturn early bail out looks for m/ \b (?: s|y|tr ) \b /xs in the entire document and the PostfixDeref late bail out looks for m/ \s* -> \s* [\@%\$] /xs inside of matches for the PerlTerm rule.

Any purported optimisation needs to be benchmarked. So how do all these optimisations interact? I could run the various options for caching, Perl version, bailing out, etc. multiple times myself and find their mean, but it would be nicer to script that. So I used Permute::Named::Iter to record elapsed time for various options. The speed ups from enabling caching and bailing out are so great and preliminary tests showed significance so what this left me with is testing (Perl version) x (warming cache) x (number of workers). While I know that the older Perl version will run faster serially, I want to compare to see if I did not inadvertently slow it down and if my optimisations bring the elapsed time over newer Perl versions close to that of old Perl versions. I get the following data sorted by mean elapsed time:

> model         <- elapsed ~ version * workers * ( cache * warm_cache + bail_out_early + bail_out_late )
> agg <- aggregate(model, mean, data = data )
> agg.sort.elapsed <- agg[order(agg$elapsed),c('version','warm_cache','workers','elapsed')]
> print(agg.sort.elapsed)
              version warm_cache workers elapsed
11 perl-5.18.4@babble       TRUE       8  36.446
5  perl-5.18.4@babble      FALSE       8  37.170
9  perl-5.18.4@babble       TRUE       4  41.960
3  perl-5.18.4@babble      FALSE       4  42.490
12 perl-5.34.0@babble       TRUE       8  47.266
10 perl-5.34.0@babble       TRUE       4  53.192
7  perl-5.18.4@babble       TRUE       2  55.320
1  perl-5.18.4@babble      FALSE       2  56.364
4  perl-5.34.0@babble      FALSE       4  58.852
6  perl-5.34.0@babble      FALSE       8  59.772
8  perl-5.34.0@babble       TRUE       2  68.158
2  perl-5.34.0@babble      FALSE       2  72.652

and the following boxplot:

Box plot comparing elapsed time versus number of workers with presence/absence of cache warming.

Since the data does not fit the assumptions for using ANOVA (elapsed time in the data has non-equal group variances), Aligned Rank Transform (ART) is used (I would have liked to do more runs and to more carefully control the environment for each run, but I didn't like waiting as they have to run serially on the same computer in order to be independent samples!):

> model.reduced <- elapsed ~ version * workers * ( warm_cache )
> art.reduced <- art( model.reduced, data )
> fit.reduced <- anova( art.reduced )
> print(fit.reduced)

Analysis of Variance of Aligned Rank Transformed Data

Table Type: Anova Table (Type III tests)
Model: No Repeated Measures (lm)
Response: art(elapsed)

                             Df Df.res F value     Pr(>F)
1 version                     1     48 151.892 < 2.22e-16 ***
2 workers                     2     48 202.148 < 2.22e-16 ***
3 warm_cache                  1     48 148.590 2.6388e-16 ***
4 version:workers             2     48  56.041 2.7902e-13 ***
5 version:warm_cache          1     48 151.381 < 2.22e-16 ***
6 workers:warm_cache          2     48 101.621 < 2.22e-16 ***
7 version:workers:warm_cache  2     48 123.130 < 2.22e-16 ***
Signif. codes:   0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In terms of main effects, the number of workers is the greatest predictor of elapsed time. Therefore with the current implementation, warming the cache helps a little bit, but less than adding more workers which makes sense as those workers will each fill the cache themselves as soon as they reach input that can be transformed. Warming the cache might help more when there are many more files processed (currently the code uses a plugin on one distribution at a time instead of all files across all distributions).

All these changes are available as:

Appendix: List of issues and pull requests participated in along the way