Is there a performance problem when using non-greedy regular expressions and match variables on long strings? See output and code below. Tested on Win/XP (Tcl 8.4.13 & 8.5.7) and FreeBSD (Tcl 8.4.13).
The time shown is when the match encompasses the whole buffer.
If match variables are _not_ used then there is no performance problem with non-greedy matching! Seems strange that just because the regexp has match vars then the performance goes down the tubes.
Has anyone come across this before and discovered a solution?
I'm the verge of going back to using greedy reg exps and iterating backwards through each match to simulate the non-greedy matching (I can do this because of the patterns of match var I am looking for)
Output from script :
= = =
71640 start regexp end regexp - non greedy, with matchVar 29677 millis end regexp - non greedy, no matchVar 0 millis end regexp - greedy, with matchVar 5 millis end regexp - greedy, no matchVar 0 millis
= = =
Here is the script:
= = =
puts [string length $longString] puts "start regexp" set now [clock clicks] regexp (.*?)($reTerminator) $longString wholeMatch message terminator puts "end regexp - non greedy, with matchVar [expr ([clock clicks] - $now) / 1000 ] millis" set now [clock clicks] regexp (.*?)($reTerminator) $longString puts "end regexp - non greedy, no matchVar [expr ([clock clicks] - $now) / 1000 ] millis" set now [clock clicks] regexp (.*)($reTerminator) $longString wholeMatch message terminator puts "end regexp - greedy, with matchVar [expr ([clock clicks] - $now) / 1000 ] millis" set now [clock clicks] regexp (.*)($reTerminator) $longString puts "end regexp - greedy, no matchVar [expr ([clock clicks] - $now) / 1000 ] millis"
= = =
There may be some performance quirks in the RE behavior, but before determining that, here are a few tips to provide better numbers. REs should whenever possible be static or single scalars, rather than the concatenated values like "(.*?)($reTerminator)". There is an underlying Tcl_ObjType for REs, but you should create an obj like:
set myRE "(.*?)($reTerminator)"
regexp $myRE $longString ...
in order for that to be cached with the scalar obj.
Also, a better method of timing that removes variance due to other system workload and caching spikes is to use the Tcl [time] command with a value of at least 100 to get better results, like:
set res [time {regexp $myRE $longString ...} 100]
puts $res
Similar results specifying the RE as a variable:
Using the following code (I chickened out of using 100 iterations because the regexp takes 100% of the CPU and I didnt fancy locking up my single core PC for that length of time):
puts [string length $longString] puts "start regexp" set myRE (.*?)($reTerminator) set res [time {regexp $myRE $longString wholeMatch message terminator} 10] puts "end regexp - non greedy, with matchVar = $res" set res [time {regexp $myRE $longString} 10] puts "end regexp - non greedy, no matchVar = $res" set myRE (.*)($reTerminator) set res [time {regexp $myRE $longString wholeMatch message terminator} 10] puts "end regexp - greedy, with matchVar = $res" set res [time {regexp $myRE $longString} 10] puts "end regexp - greedy, no matchVar = $res"One key piece of info missing is what $reTerminator is and how large $longString is. You will find various metrics for REs at http://wiki.tcl.tk/1611 that indicate different speed costs for difference parameters. When no match vars are requested, there is an optimization passed to the RE saying only yes/no is necessary.
The longString is 71640 bytes. In this timing example reTerminator is "#> " (but for the actual application, the value of reTerminator can change to be a genuine RE that's why I'm using regexp instead of "string first").
The last three bytes of the longString are the first match for the reTerminator.