Notebook

Match and replace text.¶

Let's say some gave you a repo where a class exists with the a property A.bb but someone also decided to but in A.b together with A.bbb.

You needed to refactor some names and rename bb to c.

text.replace(old,new,count) doesn't really play well with the code:

In [1]:

text = """
A.b
A.bb
A.bbb
"""

In [2]:

print(text.replace('bb', 'c'))

A.b
A.c
A.cb

As you can see, the property A.bbb became A.cb which defeats the purpose.

You could of course include A.bb in the text, but that's just playing on the specific example I've been able to come up with.

A better option: Include the name space which the text shouldn't be confused with.

Let's start with a very generic test case:

In [3]:

text = '111011010110111'

We want to replace 11 in the text with 2, so it becomes:

In [4]:

text = '1110201020111'

We could expand the test case to include replacements which are both shorter, equal to and longer than the original 11 we are looking for:

In [5]:

def test_1():
    name_space = ['111', '11', '1']
    text = '111011010110111'
    old = '11'

    expected = "1110{n}010{n}0111"
    for new in ['2', '22', '222']:
        output = replace(text, old, new, name_space)
        assert output == expected.format(n=new)

To satisfy the test case, we will only need the replace function:

In [6]:

def replace(text, old, new, name_space):
    if name_space and old not in name_space:
        raise ValueError(f"{old} is not in the namespace")

    # eliminate all irrelevant names from the name_space
    reduced_name_space = [n for n in name_space if old in n]
    # and create a bitmap for longer names where the target is within the name
    # but obviously shouldn't be overwritten:
    bytemap = [0 for _ in text]
    names = sorted(reduced_name_space, key=lambda x: len(x), reverse=True)
    for name in names:
        value = 2 if name == old else 1

        index = 0
        for i in range(text.count(name)):
            index = text.index(name, index)
            if bytemap[index] == 0:
                for j, letter in enumerate(name):
                    bytemap[j+index] = value
            index += len(name)
    # at this point the only match that has a 2 in the bitmap will be the
    # target (looking_for)
    if 2 not in bytemap:
        raise ValueError(f"{old} not found")

    new_text = []

    start, end = 0, 0
    while end < len(bytemap):
        try:
            end = bytemap.index(2, start)
            new_text.append( text[start:end] )
            new_text.append(new)
            start = end + len(old)
        except ValueError:
            new_text.append( text[start:] )
            break
    return "".join(new_text)

Now the test will work:

In [7]:

name_space = ['111', '11', '1']
text = '111011010110111'
old = '11'

expected = "1110{n}010{n}0111"
for new in ['2', '22', '222']:
    output = replace(text, old, new, name_space)
    assert output == expected.format(n=new), output
    print(old, "-->", new, "=", output)

11 --> 2 = 1110201020111
11 --> 22 = 111022010220111
11 --> 222 = 11102220102220111

Some comments:

The code is reasonably performant as the name_space is reduced to substrings which contain the old string.
As the search starts with the longest name that contains the substring, the risk of overwrite is zero.
Updating the bytemap (not bitmap) is also very efficient as we only flip a digit when we have match.
The search for replacements operates on "chunks", whereby the slicing operation doesn't duplicate the characters, but merely sets a pointer.
Finally we use .join to generate the output string.

For my tooling needs, I'm satisfied.