Friday, February 27, 2009

Searching the easy, but hard, way

The Find dialog.

Accessible via CTRL+F.

Shrouded in mystery (because someone forgot to give it a title).

Contains a checkbox that may have no meaning for some users:
Use Regular Expressions

For many people Regular Expressions may simply refer to some form of toilet humour. However for others it is a very powerful tool in the search arsenal.

Regular Expressions are a method of describing very powerful pattern matching algorithms using text. I will not attempt to give a tutorial in this blog. Rather I hope to encourage you to investigate further so that you might be able to utilise the power of regular expressions in order to make your packaging more productive.

As always (well, often) the wikipedia article on regular expressions is a good place to start.

In InstEd, when you check the Use Regular Expressions checkbox, the Find text is interpreted not as a literal string to find, but rather a "regular expression" that describes a pattern to find.

In the simple case, the pattern can be the literal text. For example, searching on "InstEd" will find the same results regardless of whether regular expressions are used. This is because the string "InstEd" is a regular expression for the literal text "InstEd". Confused?

Perhaps a more complicated case will be useful. This string "^InstEd$" is a regular expression that will only find entries where no other text than "InstEd" is in the field. Specifically, "^" matches the start of the field and "$" matches the end of the field. So, the regular expression indicates that it will only match:

Start of field, followed by "InstEd", followed by end of field.

Suppose you want to find all the fields that contain the word "InstEd" but exclude terms such as "InstEdIt" or "TryThisInstEd". You could check the Match whole word checkbox. But under the hood, that checkbox builds a regular expression to do the heavy lifting. The regular expression would be "(^|\s)InstEd($|\s)". (Actually it's a bit more complicated than that but the detail is not necessary here.)

Now you will recognise the ^ and $ characters from before. They match the start and end of the field. The | character has added an alternative, an "OR" if you like. And the \s is shorthand for whitespace (spaces, tabs etc). So the regular expression now indicates:

Match the start of the field OR whitespace, followed by "InstEd", followed by the end of the field OR whitespace.

In other words, only match when InstEd is the complete word, excluding things such as "InstEdIt" or "TryThisInstEd".

Note that if you explicitly wrote "(^|\s)InstEd($|\s)" AND checked the Match whole word checkbox, you wouldn't match anything, because under the hood the search regular expression would become "(^|\s)(^|\s)InstEd($|\s)($|\s)" and it would never find two "start of the field"s
(but it might possibly find two spaces before and after).

Similarly if you wrote "(^|\s)InstEd($|\s)" and forgot to check the Use regular expressions checkbox, you would be lucky to find a field that contains such an arcane string.

This has just scratched the surface of the power of regular expressions, but I warn you, they can become awfully complicated.

For example, the regular expression InstEd uses to find references to properties, components, files etc in Formatted fields is:

That is not prettty, but trying to write code to find such fields would be way more complicated than using that regular expression.

(On a technical note, it's a little more permissive than required, but works fine.)

Internally, InstEd uses the boost regular expression engine, and the syntax is described here.

Some things that regular expressions are useful for finding:
  • File.File entries that don't have short filenames: "[^\|]*" (remember to use the Table and Column filter dropdowns in the Find dialog).
  • Strings that span other strings:

    "Microsoft.*97": would find references to excel, word, outlook etc

    "((SOFTWARE\\Classes\\CLSID)|(CLSID))\\{MyGuid}": would find references to {MyGuid} in reg keys only in "SOFTWARE\\Classes\\CLSID" (HKLM) or "CLSID" (HKCR)
If you have other useful regular expressions, please post them here as a comment.

No comments: