Hands-on Session: Text Replacements in Multiple Files

First published — Nov 02, 2023
Last updated — Nov 02, 2023
#hands-on #perl #git #shell

Making automated replacements in text files. Shows Perl, regular expressions, Git, diff, and other Unix shell commands. Quiz.

Table of Contents

Introduction

Recently I was working with OpenAPI Generator. The task at hand was to take an OpenAPI v3 spec file and run the generator on it to produce an OpenAPI client for Python.

The generator also produced test files. However, the actual lines in test files were commented (disabled), so to make the tests run the comment characters had to be removed.

As there were ~350 files, I was certainly going to run a one-liner for this job.

This article describes a very simple and powerful way to perform these (or arbitrary other) text replacements in multiple files.

The Problem

OpenAPI Generator has produced around 350 test files, all with the same general content. Here is an example of one, in which only the structure is important:

# coding: utf-8

"""
    API for managing a service

    The version of the OpenAPI document: 1.0
    Generated by OpenAPI Generator (https://openapi-generator.tech)

    Do not edit the class manually.
"""  # noqa: E501


import unittest
import datetime

from the_sdk.the_api.models.action import Action  # noqa: E501

class TestAction(unittest.TestCase):
    """Action unit test stubs"""

    def setUp(self):
        pass

    def tearDown(self):
        pass

    def make_instance(self, include_optional) -> Action:
        """Test Action
            include_option is a boolean, when False only required
            params are included, when True both required and
            optional params are included """
        # uncomment below to create an instance of `Action`
        """
        model = Action()  # noqa: E501
        if include_optional:
            return Action(
                value = 'GET'
            )
        else:
            return Action(
                value = 'GET',
        )
        """

    def testAction(self):
        """Test Action"""
        # inst_req_only = self.make_instance(include_optional=False)
        # inst_req_and_optional = self.make_instance(include_optional=True)

if __name__ == '__main__':
    unittest.main()

(If you want to follow the tutorial hands-on, feel free to save this file locally and even make multiple copies of it, to simulate having more than one file.)

As we can see, the important lines in testAction mentioning make_instance were commented and they would not execute.

So the primary task was to go through all the files and remove comments (lines beginning with #).

The Solution

Preparation

When designing a solution for problems of this type, the steps are usually:

  1. Determine what is the general class of the problem we are solving. (In our case, it is “making automated replacements in multiple files at once”)

  2. Decide which tool or approach to use for the actual text file edits. (In our case, we are going to show using Perl)

  3. Decide how to select the files which should be changed. (Simple in this case – just tests/*.py)

  4. Figure out how to review the changes, undo them, and try the improved procedure again if needed. (Explained in more detail below)

  5. Carry out the task

Tools

  1. Perl is a scripting language unsurpassed in its ability to make complex text filtering and editing tasks in only a couple of lines of code, often even in a single line called a oneliner. So we are going to use Perl for editing. (We could also use old-school Unix languages like AWK or sed, but Perl can do all they can do and much more.)

  2. To select the files which should be transformed, we don’t even have to think about it – since we want to apply the transformation on all Python files in the directory tests/, we are simply going to specify tests/*.py.

  3. And in terms of verifying our changes, we are going to place the directory tests/ under temporary Git revision control, and then we will be able to run git diff later to inspect the changes. In the case of incorrect edits, we will be able to just run git reset --hard to undo the changes and try again.

Action

We will enter the directory tests/ and quickly add it under local/temporary Git version control:

cd tests
git init
git add .
git commit -am "Initial"

Then we are going to run a Perl oneliner to carry out replacements in the text.

perl -pi -e's/^(\s*)# /$1/g' *.py

Running this line basically already solved the task, which shows how powerful Unix text editing capabilities are. But we are going to continue discussing it and show 2 more iterations for the benefit of this tutorial.

First, a couple explanations of the above line:

  1. Option -p causes Perl to run the script (either a script file or an in-place oneliner) on every record (which by default is every line) of the input file(s). At the end of script (that is, at the end of every line), it implicitly prints out the resulting line. (For the same behavior but without printing, one would use option -n instead.)

  2. Option -i instructs Perl to do in-place edits of files. That means we will be reading from files, and thanks to the mentioned option -p we will automatically be writing transformed lines back to the files

  3. Option -e specifies the Perl script in-place, without requiring it to be in a separate file

  4. And the part 's/(\s*)# /$1/g' is our actual script. It uses regular expressions and says:

    1. Using regular expressions, substitute all occurrences (s///g)

    2. Of zero or more whitespace characters followed by a literal # ((\s*)# )

    3. With only that whitespace (to preserve it), but without the # ($1)

  5. And do so for all files matching glob pattern *.py

Review

The above line has executed instantly, and we need to review the results for correctness.

We can review the changes simply by running git diff. It will show us a series of changes in a format called unified diff where lines beginning with - indicate lines changed, and lines with + indicate their replacements:

git diff

@@ -1,4 +1,4 @@
-# coding: utf-8
+coding: utf-8

 """
     HAProxy Fusion Control Plane
@@ -32,7 +32,7 @@
             include_option is a boolean, when False only required
             params are included, when True both required and
             optional params are included """
-        # uncomment below to create an instance of `Action`
+        uncomment below to create an instance of `Action`
         """
         model = Action()  # noqa: E501
         if include_optional:
@@ -47,8 +47,8 @@

     def testAction(self):
         """Test Action"""
-        # inst_req_only = self.make_instance(include_optional=False)
-        # inst_req_and_optional = self.make_instance(include_optional=True)
+        inst_req_only = self.make_instance(include_optional=False)
+        inst_req_and_optional = self.make_instance(include_optional=True)

 if __name__ == '__main__':
     unittest.main()

Right off the bat we see that our script has unexpectedly removed the # in front of lines like # coding: utf-8. We did not intend to modify those lines.

We notice that this pattern is present in all files right at the beginning of the line (without any whitespace preceding it), while the edits we actually want to make are always prefixed with some whitespace.

So we are going to revert our changes and repeat the replacements, but this time expecting not “zero or more whitespace” but “one or more whitespace” in front of # .

Action

We reset the files to their original:

git reset --hard

And we specify “one or more” instead of “zero or more” in regular expressions by changing * to +:

perl -pi -e's/^(\s+)# /$1/g' *.py

Review

Edits have again been completed instantly, and we need to review them.

git diff

@@ -32,7 +32,7 @@
             include_option is a boolean, when False only required
             params are included, when True both required and
             optional params are included """
-        # uncomment below to create an instance of `Action`
+        uncomment below to create an instance of `Action`
         """
         model = Action()  # noqa: E501
         if include_optional:
@@ -47,8 +47,8 @@

     def testAction(self):
         """Test Action"""
-        # inst_req_only = self.make_instance(include_optional=False)
-        # inst_req_and_optional = self.make_instance(include_optional=True)
+        inst_req_only = self.make_instance(include_optional=False)
+        inst_req_and_optional = self.make_instance(include_optional=True)

 if __name__ == '__main__':
     unittest.main()

The original problem has been fixed, but in the output we now see one more unintended edit. It is in the textual/commented part of the file rather than in the actual code.

Because this part of the text is found in a Python string (denoted by """...""") and serves as an unimportant comment, we could even leave it as-is and be happy with our replacements.

However, strictly speaking, this was an unintended edit, and apart from being clumsy it is also doubling the size of our diff, so we want to avoid making those changes.

We need another iteration.

Action

We reset the files to their original:

git reset --hard

And we modify our oneliner to only make the replacements if the comment (# ) does not begin with the word “uncomment”.

We do this by using one of basic regular expressions features called a “negative lookahead assertion”, signified by (?!...). It specifies text that must not follow the matched part:

perl -pi -e's/^(\s+)# (?!uncom)/$1/g' *.py

Review

We do another git diff and verify that the results look exactly like we wanted:

git diff

@@ -47,8 +47,8 @@

     def testAction(self):
         """Test Action"""
-        # inst_req_only = self.make_instance(include_optional=False)
-        # inst_req_and_optional = self.make_instance(include_optional=True)
+        inst_req_only = self.make_instance(include_optional=False)
+        inst_req_and_optional = self.make_instance(include_optional=True)

 if __name__ == '__main__':
     unittest.main()

At this point we can save the diff to a file if needed (git diff > activate-tests.patch) and remove the local/temporary Git directory we have created in the directory tests/ (rm -rf .git).

The patch file can then be copied and applied to any tree of unmodified tests/ files. This would be done with a command such as patch -p1 < activate-tests.patch, ran from within the directory tests/.

Quiz

A “diff” or a difference between two files always needs two states (old and new) to show the actual differences.

We have conveniently used Git for this purpose – for each file, git diff showed us a comparison between the last committed state in Git (in .git/) and the current/actual content of the file on disk.

But could we have used something else?

Yes. Instead of using Git and git diff to show diffs between the state saved in the temporary .git/ directory and the latest state on disk, we could have made a copy of the tests/ directory, such as cp -a tests tests,orig.

Then we could have checked for differences between all files in that directory and files in our changed/updated directory with a command such as diff -ruNP ../tests,orig/ ./.

(Note that diff shown here is a standalone program, and not an option/subcommand of git. In fact, diff is the original program; git diff only mimics the functionality and output of diff -u.)

If we used that approach, how would the undo procedure look like?

In that case, we could undo the changes by deleting our directory tests/ and copying it from tests,orig/ again (cd .. && rm -rf tests && cp -a tests,orig tests && cd tests).

Another creative option would be to produce the diff output and apply it, but in the reverse direction (from old to new, instead from new to old). We would use a command such as diff -ruNP ../tests,orig/ ./ | patch -Rp1 for that.

Finally, is there an alternative way in which we could have solved not modifying the # coding: utf-8 lines?

Yes. If you recall from the explanations above, we have identified that # coding: utf-8 is always found at the beginning of line, while all our intended edits are prefixed by some whitespace. So our solution was to simply require that one or more whitespace characters were found before the # .

However, we could have also noticed that # coding: utf-8 is also always found as the first line in the file. So instead of the approach we used, we could have told Perl to only do the replacements if the current line in the file is greater than 1 ($INPUT_LINE_NUMBER > 1 or $. > 1 for short in Perl notation).

And we could have done that by adding an if that places a condition on the line number in each file: perl -pi -e's/^(\s*)# (?!uncom)/$1/g if $. > 1' *.py

Automatic Links

The following links appear in the article:

1. AWK - https://en.wikipedia.org/wiki/AWK
2. Regular Expressions - https://en.wikipedia.org/wiki/Regular_expressions
3. Sed - https://en.wikipedia.org/wiki/Sed
4. OpenAPI Generator - https://openapi-generator.tech/
5. Negative Lookahead Assertion - https://perldoc.perl.org/perlre#Lookaround-Assertions
6. Unified Diff - https://www.gnu.org/software/diffutils/manual/html_node/Unified-Format.html
7. Perl - https://www.perl.org/