GNU/Linux includes many utilities for working with text files through the shell. In this post we take a quick look at accessing and manipulating text files in a “column-wise” mode.
Suppose you have the following two files, each with two columns separated by the TAB character.
$cat file1
Alice Paris
Bob Tokyo
Mary London
John New York
$cat file2
13 May Orange
19 Oct Blue
11 Nov Black
29 Feb Red
The data in the two files are in fact related, i.e. file2 contains the date of birth and favourite colour of the people mentioned in file1 (assuming also that the files are sorted correctly). It would make sense to combine the two files together so that each row has the full data for each person. The paste
command does just that.
$paste file1 file2 > file3
$cat file3
Alice Paris 13 May Orange
Bob Tokyo 19 Oct Blue
Mary London 11 Nov Black
John New York 29 Feb Red
Suppose that we are only interested in the name and date of birth of each person, and we can discard the hometown and favourite colour information. The cut
command is what we shall use:
$cut file3 -f 1,3 > file4
$cat file4
Alice 13 May
Bob 19 Oct
Mary 11 Nov
John 29 Feb
Our next and final requirement is to reorder the columns differently. Instead of having the name followed by date of birth, suppose we want to have the columns the other way round. Unfortunately cat -f 3,1
produces exactly the same output as cut -f 1,3
, so the cut
command will not be sufficient. We have to use sed
instead.
$sed -e 's/([^t]*)t([^t]*)/2t1/' file4 > file5
$cat file5
13 May Alice
19 Oct Bob
11 Nov Mary
29 Feb John
How does that work? Well ([^t]*) is a “named expression” which matches all characters except TAB. The search pattern looks for two of them, separated by TAB (t). In the replace-with part, they are referred to as 2 and 1, again separated by t.
Of course if file5 was what we ultimately wanted from the beginning as our output, we could have simply piped commands together:
$paste file1 file2 | cut -f 1,3 | sed -e 's/([^t]*)t([^t]*)/2t1/' > file5
or alternatively
$paste file1 file2 | sed -e 's/([^t]*)t([^t]*)t([^t]*)t([^t]*)/3t1/' > file5