Regular Expressions for Humans

The changes to the Auto Responder to support SMS change a few things in my program that affect the users. One of those things is the registration process. With email, the user is expected to send the word “team” followed by their team number to register@mucow.net. The code sees that destination and then parses the data to register the person. That process can’t work for the SMS text messaging part of the system because all text messages are sent to the exact same destination phone number.

For all other message handling, the user can send their two word message, and all other messages will have two words, to x@mucow.net. The code is not picky and any single letter email address like b@mucow.net will work. The parser expects the second word to be the keyword and the first one to be the answer to a question. If the puzzle says “Send the number of cows to ‘cows’” then the user would send a message with contents like “27 cows” to n@mucow.net. The system is also built to handle “27” sent to cows@mucow.net. In fact, that was the original way this worked until I realized that using separate addresses was inconvenient for the user.

It would be easy to add code to check for a message of the form “register team 2” or “team 6 register” but I want to write good code that doesn’t have special cases when not needed. So this is where the title of the post comes in: How do I write a regular expression parser so that the administrator can specify a wider variety of incoming message formats while being an easy to remember expression syntax?

Regular Expression on Wikipedia

I do not want a regular human using this system to need to remember how to use any regular expression. I also want them to be able to use the simplest form possible if a regular expression is needed. That’s why the current expressions use the asterisk * and the question mark ? as the only variable parts of the expression. Some people may remember that these can be used on most computers to get a listing of files in a directory (or Folder to young Windows users) that are a specific subset of the total set of files.

> dir *.pdf

These days, even hoping that a user knows that the above statement results in a list of all PDF files being shown is a stretch of the imagination. Still, it’s the simplest regular expression syntax that I know. These characters are often called “wildcard” characters and are not referred to as parts of a regular expression.

The current configuration file contains entries that have keyword/answer pairs that are used to match incoming messages to the expected response for those messages. Here is what I am thinking:

I will still use only the asterisk and question mark. I have parsing code for them already working.

All parsing will be case insensitive.

I will allow the wildcard characters to appear in what is currently called the keyword. For instance, “register team *” would be a valid keyword value.

All characters that matched the wildcard characters would be combined into a single word and that is used as the answer text that is used for the final answer comparison.

Wildcard characters are allowed in the answer setting as they are now.

If no wildcard characters are in the keyword then an asterisk is assumed to be there before the keyword text with a space separating them.

This should work and will allow me to use the multi-word registration keyword/answer with less special-case code than I have now.

Now back to real work. Hopefully, I can get to coding this new system tonight after dinner.