The Surprising Difficulty of a Simple String Problem

The Setup

This story revolves around a feature I added to a recent side project, one which checks formatting of code comments based on opinionated rules. The feature was to add a heuristic check to see if a comment occurred after some code, because one of these rules is that comments should be on their own line. However, the main situation we're trying to avoid with this check is accidentally flagging a line as containing a trailing comment when the comment marker actually occurs within a string:

x = "# This is fine"
x = "" # This is not

Thus began my effort to reach the current implementation of is_likely_in_str():

def is_likely_in_str(line, index):
    """
        Return `True` if the `index` is likely to be within a string,
        according to the handling of strings in typical programming
        languages. Index `0` is considered to be outside a string.
    """

This seemed to have a fairly trivial implementation to start off with:

def is_likely_in_str(line, i):
    cur_str_char = None
    for char in line[:i]:
        if char in ['\'', '"', '`']:
            cur_str_char = char if cur_str_char is None else None
    return cur_str_char is None

And it worked! It purposefully doesn't handle some obvious cases such as comment markers in multi-line strings, but running this heuristic on the moderately large codebase that I was testing with actually yielded zero false positives.

That was a nice start, but the main issue I wanted to account for for the future was escaped delimiters:

char* eg = "Quoting \"http://localhost\" in C would be a bad idea here";

Handling this was a little trickier than I expected, so I decided to do up some unit tests to experiment with different implementations.

Simple Unit Testing

Writing up the unit tests was a straightforward affair. I opted for a simple, table-driven approach where I would have a source string paired with a list of booleans, where each boolean represented whether is_likely_in_str() should return True for that index:

TESTS = [
    (
        "x = 'abc'",
        [False, False, False, False, False, True, True, True, True, False],
    ),

Straightforward and simple, but borderline unusable; writing such definitions by hand would be tedious and error prone, and figuring out what those lists should look like in the case of a bug would be likely to just add oil to a flame. Even a more compact representation doesn't quite do it, since the misalignment between boundary indices versus actual indices still obscures the intent:

    (
        "x = 'abc'",
        "0000011110",
    ),

So... Abstract it! We can process the test case with a function to produce our source string and our list of target booleans. With that we can make it very clear at a glance what sections we expect to be inside a string and what ones should be outside:

    ["x = '", "abc", "'"],

And here we arrive at the first simple string problem that turned out to be surprisingly difficult:

def is_in_strs(substrs):
    """
        Return a list containing a boolean for each character boundary in
        `''.join(substrs)`, where each element is `True` if it's "inside a
        string" and `False` otherwise.
    """

The First Hurdle

Due to the representation, it seemed straightforward that I'd just toggle the state of in_str between each section and append a copy of the state for each character. I started with the boundary condition of [False] since it's assumed that a line starts off by not being within a string:

def is_in_strs(substrs):
    in_str = False
    in_strs = [in_str]
    for substr in substrs:
        for char in substr:
            in_strs.append(in_str)
        in_str = not in_str
    return in_strs

Ah, but how naive programming at 9:00 PM on a Thursday can make you when you're no longer in college:

is_in_strs(["x = '", "abc", "'"])
# We get this: 0000001110
# But we want: 0000011110
#                   ^

Moreover, is_in_strs(["x = '", "", "'"]) will give us all Falses! So I focused on this case with the empty string. It became apparent that there were a few subtleties to this problem that I should take into consideration:

In this representation some character boundaries theoretically get represented twice. For instance, in ["x = '", "1"], the boundary between ' and 1 gets represented once by the end of the first substring and again by the start of the second substring.
There is a bit of asymmetry in how delimiters are handled. In the ["x = '", "abc", "'"] sequence the first ' signals that a change in state occurs after ', but the second ' signals that that change should have occurred before the '. Changing the representation to something like ["x = ", "'abc", "'"] may have helped make things more consistent for processing, but at the cost of being subjectively less readable.

The next iteration focused on the issue with the empty substring:

    in_strs = [in_str]
    for substr in substrs:
        if len(substr) == 0:
            in_strs[-1] = not in_strs[-1]
        else:
            in_strs += [in_str] * len(substr)

One small improvement here was simplifying for char in substr: in_strs.append(in_str) into in_strs += [in_str] * len(substr), but other than that this solution didn't present much progress - it handled the specific case of an empty substring (poorly), but nothing else.

So, I tried breaking the problem down further on paper.

  _________   _   _
  x   =   '   a   '       len=5
F   F   F   T   T   F

I represented the desired output for this in sequences of Ts and Fs. The above would be Fx3 + Tx2 + Fx1 for a string with substrings of length 3 1 1, and the earlier example with an empty string of lengths 3 0 1 would give Fx3 + Tx1 + Fx1. I then separated out the boundary values in the target output, and plotted out different examples, such as the following:

_____   _        3     0     1
x = '   ' = F + Fx2 + Tx1 + Fx0 + F

_____ _ _        3     1     1
x = ' a ' = F + Fx2 + Tx2 + Fx0 + F

_ _____ _____ _____ _        1     3     3     3     1
' x = 1 ' = ` a = 2 ` = F + Fx0 + Tx4 + Fx2 + Tx4 + Fx0 + F

At this point I just eyeballed the different examples to see if a pattern emerged. Mercifully it did, and it boiled down to: "if a substring represents a string, add 1 to the number of Ts it has, otherwise subtract 1 from its Fs." This gave me:

def is_in_strs(substrs):
    in_str = False
    in_strs = [in_str]
    for substr in substrs:
        mod = 1 if in_str else -1
        in_strs += [in_str] * (len(substr) + mod)
        in_str = not in_str
    return in_strs

Fairly straightforward, but luckily a test alerted me to an edge case that I overlooked:

is_in_strs(["x = '"])
# We get this: 00000
# But we want: 000001
#                   ^

I'll admit it though; my fix for this was just the first thing to come to my head that I thought would get tests passing:

    if in_str:
        in_strs += [False]
    return in_strs

Thankfully (or, suspiciously?) this passed all the tests. I ignored the "seat-of-the-pants" approach that I used to arrive at this answer for now. The implementation that I actually added to my project also contained some special string splitting logic to let me mark expected substrings in a more readable fashion. With that, I was finally able to make a readable list of test cases with which to validate my is_likely_in_str() function (the trailing comments here are only presented for clarification and aren't in the actual source code!):

"x = '(abc)'"   # `abc`  is in a string
"x = '(ab)'c"   # `ab`   is in a string
"x = '(ab\'c"   # `ab'c` is in a string
"x = '(ab\\)'c" # `ab\`  is in a string

Out of the Woods?

With the unit tests finally giving results that could be manually verified, I moved on to the more straightforward implementation of the original is_likely_in_str() function:

def is_likely_in_str(line, index):
    cur_str_char = None
    in_str = lambda: cur_str_char is not None

    for i, char in enumerate(line[:index]):
        if char in ['\'', '"', '`']:
            if in_str():
                escaped = i > 0 and line[i-1] == '\\'
                double_escaped = i > 1 and line[i-2] == '\\'
                if char == cur_str_char:
                    if escaped:
                        if double_escaped:
                            cur_str_char = None
                    else:
                        cur_str_char = None
            else:
                cur_str_char = char
    return in_str()

I felt this to be a fairly neat implementation, and the logic matched my intuition. There wasn't even much added to account for escaping - if we find our opening delimiter then we make sure that it isn't escaped, and if it is escaped then we further make sure that the escape isn't escaped. Makes sense and reads logically, the only real problem with it is that it doesn't work.

Thankfully I had a nice little suite of test cases to remind me of how brazen it was of me to think that I could write a line-processing function on my first try. One test case was very helpful in pointing out the fact that I had completely overlooked tracking of the current escaped state in the loop:

    "x = '\\\\\\'",

Ignoring the escaping that was added for Python's sake, this string is stored as x = '\\\'. In this test case, the first pair of backslashes should "cancel out", and the third backslash should escape the final '. As such, the string is never closed, and so the final boundary index of this string should be considered to be "in a string". However, the current is_likely_in_str() implementation gives us the opposite result in this case. So the next thing is to actually keep a record of whether the current character is escaped or not:

def is_likely_in_str(line, index):
    cur_str_char = None
    in_str = lambda: cur_str_char is not None
    escaped = False

    for i, char in enumerate(line[:index]):
        if char in ['\'', '"', '`']:
            if in_str():
                if char == cur_str_char and not escaped:
                    cur_str_char = None

                if char == '\\':
                    escaped = not escaped
                else:
                    escaped = False
            else:
                cur_str_char = char
                escaped = False

    return in_str()

Three more failures...

    "x = '\\'",
    "x = '\n\\'",
    "x = '\\\\\\'",

The end of all of these test cases should be considered to be in a string, but each returned False. Naturally so, as this implementation was only checking whether the current character was \ after it had already confirmed that it was a string delimiter.

Also, my notes start giving up somewhat around here. I guess it's difficult to keep a log of how badly you're being beaten by what could conceivably appear in the warm-up round of a programming competition. But there is a happy ending to this story in that, as bruised as my ego was, I did finally arrive at the simple solution that I had set out to achieve, by taking the escape check out of the delimiter check:

def is_likely_in_str(line, index):
    cur_str_char = None
    escaped = False

    def in_str():
        return cur_str_char is not None

    for i, char in enumerate(line[:index]):
        if char in ['\'', '"', '`']:
            if in_str():
                if char == cur_str_char and not escaped:
                    cur_str_char = None
            else:
                cur_str_char = char

        escaped = in_str() and char == '\\' and not escaped

    return in_str()

Finally, my tests were figuratively green. The only thing left to do was to write a blog post to lend a little inspiration to the next generation - to say, yes, if you study hard at college, work at a variety of software development companies and invest significant portions of your free time working on hobby code and learning new programming languages and technologies, you too can spend the better part of a week failing to write 20-line string processing function for a hobby project.

But that's not to say that I didn't have a lot of fun doing it.

Postscript

Why `is_in_strs()` works

The last bit of experimentation on is_in_strs() seemed to result in a somewhat magical solution that "just worked" according to my test cases. That's fine during research but it's always better to know why code is working rather than leaving it up to chance. Examining this implementation a bit more closely, the logic of why it works follows quite naturally.

First of all, the initial False we added to our result list accounts for the fact that we're working with boundary indices and not regular indices, so we want to end up with len(str) + 1 elements in our list of booleans.

Second of all, this procedure effectively performs the string balancing that was observed earlier. That is, if we had an input like ["x = ", "'abc", "'"], then the final is_in_strs() could be simplified to the following:

def is_in_strs(substrs):
    in_str = False
    in_strs = [in_str]
    for substr in substrs:
        in_strs += [in_str] * len(substr)
        in_str = not in_str
    return in_strs

Or even the following, at the expense of readability:

def is_in_strs(substrs):
    groups = [[i % 2 == 1] * len(s) for i, s in enumerate(substrs)]
    return [False] + [v for group in groups for v in group]

If we then consider the desired representation (["x = '", "abc", "'"]) in relation to this "logical" layout, we can think of it as moving the opening delimiter from the "inside a string" substring to the "outside a string" one. At this point we account for that fact by modifying the length of the different substrings. We also append an extra False in the case that we end outside a string (confusingly checked using if in_str: in the given solution), because we'll have removed the final delimiter from the length of the string and this will need to be accounted for.

Testing `is_in_strs()`

The is_in_strs() function that I was going to use for testing ended up being surprisingly complicated, so realistically it needed its own tests. I used a small set of doctests for this.