Skip to main content

Assumptions / घारणा

In the team I started working with recently, I am advocating for code reviews, best practices, tests and using CI/CD. I feel in Python ecosystem it is hard to push reliable code without these practices.

I was assigned an issue to find features from text and had to extract money figures. I started searching for existing libraries (humanize, spacy, numerize, advertools, price-parser). These libraries always hit an edge case with my requirements. I drafted an OpenAI prompt and got a decent regex pattern that covered most of my requirements. I made a few improvements to the pattern and wrote unit and integration tests to confirm that the logic was covering everything I wanted. So far so good. I got the PR approved, merged and deployed. Only to find that the code didn't work and it was breaking on staging environment.

As the prevailing wisdom goes around regular expressions based on 1997 chestnut

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

I have avoided regular expressions, and here I am.

I was getting following stacktrace:

File "/lib/number_parser.py", line 19, in extract_numbers
pattern = re.compile("|".join(monetary_patterns))
File "/usr/local/lib/python3.10/re.py", line 251, in compile
return _compile(pattern, flags)
File "/usr/local/lib/python3.10/re.py", line 303, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/local/lib/python3.10/sre_compile.py", line 788, in compile
p = sre_parse.parse(p, flags)
File "/usr/local/lib/python3.10/sre_parse.py", line 955, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/local/lib/python3.10/sre_parse.py", line 444, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "/usr/local/lib/python3.10/sre_parse.py", line 672, in _parse
raise source.error("multiple repeat",	
re.error: multiple repeat at position 11

Very confusing. Seems there was an issue with my regex pattern. But the logic worked, I tested it. The pattern would fail for a certain type of input and work for others. What gives? I shared the regex pattern with a colleague and he promptly identified the issue, there was a redundant + in my pattern. I wanted to look for empty spaces and I had used a wrong pattern r'\s*+'.

I understand that Python is a interpreted and dynamically typed and that's why I wrote those tests(unit AND integration), containerized the application to avoid the wat. And here I was, despite all the measures, preaching best practices and still facing such a bug for the first time. I assumed that interpreter will do its job and complain about the buggy regex pattern and my tests would fail. Thanks to Punch we further dug into this behavior here.

A friend of mine, Tejaa had shared a regex resource: https://regex-vis.com/, it is a visual representation (state diagram of sorts) of the grammar. I tested my faulty regex pattern with \s*+ and the site reported: Error: nothing to repeat. This is better, error is similar with what I was noticing in my stack trace. I also tested the fixed pattern and the site showed a correct representation of what I wanted.

Always confirm your assumptions.

assumption is the mother of all mistakes (fuckups)