ActiveState Community

non-greedy regexp with matchVars : performance issue?

Posted by robert.parker on 2009-06-08 11:18
OS: All / Any

Is there a performance problem when using non-greedy regular expressions and match variables on long strings? See output and code below. Tested on Win/XP (Tcl 8.4.13 & 8.5.7) and FreeBSD (Tcl 8.4.13).

The time shown is when the match encompasses the whole buffer.

If match variables are _not_ used then there is no performance problem with non-greedy matching! Seems strange that just because the regexp has match vars then the performance goes down the tubes.

Has anyone come across this before and discovered a solution?
I'm the verge of going back to using greedy reg exps and iterating backwards through each match to simulate the non-greedy matching (I can do this because of the patterns of match var I am looking for)

Output from script :
= = =

71640
start regexp
end regexp - non greedy, with matchVar 29677 millis
end regexp - non greedy, no matchVar 0 millis
end regexp - greedy, with matchVar 5 millis
end regexp - greedy, no matchVar 0 millis

= = =

Here is the script:
= = =

puts [string length $longString]

puts "start regexp"

set now [clock clicks]
regexp (.*?)($reTerminator) $longString wholeMatch message terminator
puts "end regexp - non greedy, with matchVar [expr ([clock clicks] - $now) / 1000 ] millis"

set now [clock clicks]
regexp (.*?)($reTerminator) $longString
puts "end regexp - non greedy, no matchVar [expr ([clock clicks] - $now) / 1000 ] millis"

set now [clock clicks]
regexp (.*)($reTerminator) $longString wholeMatch message terminator
puts "end regexp - greedy, with matchVar [expr ([clock clicks] - $now) / 1000 ] millis"

set now [clock clicks]
regexp (.*)($reTerminator) $longString
puts "end regexp - greedy, no matchVar [expr ([clock clicks] - $now) / 1000 ] millis"

= = =

jeffh | Mon, 2009-06-08 13:40

There may be some performance quirks in the RE behavior, but before determining that, here are a few tips to provide better numbers. REs should whenever possible be static or single scalars, rather than the concatenated values like "(.*?)($reTerminator)". There is an underlying Tcl_ObjType for REs, but you should create an obj like:
set myRE "(.*?)($reTerminator)"
regexp $myRE $longString ...
in order for that to be cached with the scalar obj.

Also, a better method of timing that removes variance due to other system workload and caching spikes is to use the Tcl [time] command with a value of at least 100 to get better results, like:
set res [time {regexp $myRE $longString ...} 100]
puts $res

robert.parker | Tue, 2009-06-09 05:55

Similar results specifying the RE as a variable:

71640
start regexp
end regexp - non greedy, with matchVar = 28800271.8 microseconds per iteration
end regexp - non greedy, no matchVar = 739.9 microseconds per iteration
end regexp - greedy, with matchVar = 5416.1 microseconds per iteration
end regexp - greedy, no matchVar = 750.3 microseconds per iteration

Using the following code (I chickened out of using 100 iterations because the regexp takes 100% of the CPU and I didnt fancy locking up my single core PC for that length of time):

puts [string length $longString]

puts "start regexp"

set myRE (.*?)($reTerminator)

set res [time {regexp $myRE $longString wholeMatch message terminator} 10]
puts "end regexp - non greedy, with matchVar = $res"

set res [time {regexp $myRE $longString} 10]
puts "end regexp - non greedy, no matchVar = $res"

set myRE (.*)($reTerminator)

set res [time {regexp $myRE $longString wholeMatch message terminator} 10]
puts "end regexp - greedy, with matchVar = $res"

set res [time {regexp $myRE $longString} 10]
puts "end regexp - greedy, no matchVar = $res"

jeffh | Tue, 2009-06-09 14:15

One key piece of info missing is what $reTerminator is and how large $longString is. You will find various metrics for REs at http://wiki.tcl.tk/1611 that indicate different speed costs for difference parameters. When no match vars are requested, there is an optimization passed to the RE saying only yes/no is necessary.

robert.parker | Wed, 2009-06-10 01:34

The longString is 71640 bytes. In this timing example reTerminator is "#> " (but for the actual application, the value of reTerminator can change to be a genuine RE that's why I'm using regexp instead of "string first").

The last three bytes of the longString are the first match for the reTerminator.