Consider the following regexes, numbered for readability:
1 f?oo
2 [f]?oo
3 [f?]oo
4 (f?)oo
5 (f)?oo
Do they behave as expected? Do they behave in the same way? Are they equivalent?
To answer if they behave as expected, we just need to set what our expectations are. Looking at the regexes, it’s pretty obvious that we are interested in matching on “oo” prefixed with an optional “f”. Now that we know what our expectations are, let’s throw these test cases at it, “oo” and “foo” seem to be appropriate.
We run our tests and discover all except for #3 have the expected result of a match success for both those test cases. Specifically, sneaky #3 will correctly match “foo” but fail to match “oo”.
Now on to the next question, “Do they behave the same way?” You might be tempted to say “yes” because they appear to return the same result given the same input. But, if you’re pedantic about regexes, you’ll remember that parentheses indicate that matched text should be stored in a capture group. I wrote a little program to print “$+” after regex matching, and got the following:
Input: foo
1 f?oo MATCH last capture:
2 [f]?oo MATCH last capture:
3 [f?]oo MATCH last capture:
4 (f?)oo MATCH last capture: f
5 (f)?oo MATCH last capture: f
When you’re evaluating different options on how to code something, don’t just look at inputs and outputs. Side-effects have the potential to be very dangerous, especially because they’re hard to debug. In this case we’re probably fine, but of course it depends on what your application is doing and who is calling you and how often…
So now we’ve looked at functionality and some of the side effects of our examples. Is it safe to say that options 1 and 2 are the same, and options 4 and 5 are the same? Not yet. Here is where the intangibles come in. For example, it can be argued that options 2, 4, and 5 are more readable, and hence better choices. Also consider that requirements might change — what if I wanted to change my regex such that it would match “boo” and “foo” but not “oo”? Then option 2 makes sense because the square brackets indicate character set, and I would only have to add the letter “b” in there and I’m done. Okay, how about “boo” and “foo” and “taboo”? Uh-oh. Now we can’t use the character set because we’ve introduced prefixes longer than a single character. Maybe 4 or 5 are the way to go…
This is my favorite part of the code reviewing process. It’s easy to tell if the code is working*, it’s moderately easy to tell if it’s not doing anything bad, but figuring out if it’s extensible, maintainable, scalable? That’s more art than science.
BTW, if we’re deciding between 4 and 5, my pick is for 5 since it makes it clear that the “?” is acting on the entire group of options, and not just the last one.
* Especially if you have tests!