Linux three swordsmen: efficient data analysis

Keywords: Data Analysis

1. What is the Linux three swordsman

First of all, we want to know what the Linux three swordsmen are?

  1. The first tool is grep, which performs pattern pattern in each file or matching line, that is, searches the content according to the regular expression and prints it out
  2. The second tool is awk, which is abbreviated by three authors (aho, Kernighan, Weinberger). It can process the segments according to the located data row.
  3. The third tool is sed, which is an introduction to the stream editor. It can filter text according to the input content and add, delete, modify and query the filtered data.

These three tools are used in combination to deal with many data analysis scenarios in the Shell, so people collectively call these three tools Linux three swordsmen.

2. What's the use of Linux three swordsmen

Next, we will compare the three swordsmen with SQL to see what they can do?

  1. grep is equivalent to SQL select * from table. It can search and locate data.
  2. awk is equivalent to SQL select columns from table, which can slice data.
  3. sed is equivalent to SQL SELECT columns from table where columns = XX, update table set columns=new where columns=old. It can perform conditional query or modification of data

You can find that grep and awk can be used in combination to find and segment data. Grep can also be used in combination with sed to find and modify data. They can also be used together to complete a series of operations, which is equivalent to map reduce in big data processing, Let's look at how to use them.

3. How to use Linux three swordsmen

We first create a file that contains three pieces of data, and then take the contents of the file as a demonstration operation

#Method 1: VIM test.txt (create a new file in the current directory and enter the following three pieces of data)
hello from hogwarts
hello from sevenriby
hello from testerhome
# Method 2: use echo command input, where - e parameter means to enable the interpretation function of escape character
echo -e 'hello from hogwarts\nhello from sevenriby\nhello from te
sterhome' > test.txt

# -------------------------Check-------------------------
#rosaany@Rosefinch:~$ cat test.txt
#hello from hogwarts
#hello from sevenriby
#hello from testerhome

3.1 grep

grep is used to find the relevant content and print the corresponding data according to the regular expression pattern.

  • Find out if the hogwarts word is in the file
rosaany@Rosefinch:~$ grep hogwarts test.txt
hello from hogwarts
# Match to the hogwarts word and output a whole line
  • Find out if the word beginning with hello is in the file
rosaany@Rosefinch:~$ grep '^he' test.txt
hello from hogwarts
hello from sevenriby
hello from testerhome
  • Find out if words containing i or y letters are in the file
rosaany@Rosefinch:~$ grep '[iy]' test.txt
hello from sevenriby
  • Add the - v parameter to filter out the matching content
rosaany@Rosefinch:~$ grep -v '[iy]' test.txt
hello from hogwarts
hello from testerhome
  • -The o parameter indicates that only matching data is printed
rosaany@Rosefinch:~$ grep -o '[iy]' test.txt
i
y
  • -The E parameter indicates that extended regular expressions are supported

grep here (pattern pattern pattern) regular expressions are divided into two categories. The first category is called basic expressions, which include typical regular identifiers.

  1. ^Indicates the beginning;
  2. $indicates the end;
  3. [] represents any matching item;
  4. *Represents 0 or more;
  5. . represents any character

The second type is extended expression, which makes some extensions on the basic expression to support higher-level syntax and more complex conditions.

  1. ? Indicates non greedy matching;
  2. +Represents one or more;
  3. () indicates grouping;
  4. {} represents a range constraint;
  5. |Represents any one that matches multiple expressions
  • Add the - E parameter to find out whether a word including seven or home is in the file
rosaany@Rosefinch:~$ grep -E "(seven|home)" test.txt
hello from sevenriby
hello from testerhome
  • Without adding the - E parameter, find out whether a word including seven or home is in the file
rosaany@Rosefinch:~$ grep  "\(seven\|home\)" test.txt
hello from sevenriby
hello from testerhome
# --------The same effect can be achieved by adding the \ escape character to transfer the matching conditions.

3.2 awk

awk is a language parsing engine. It is very powerful and has complete programming characteristics. It can execute commands, network requests and so on.

Next, let's look at the syntax of awk and the related knowledge of awk 'pattern{action}'. Pattern is the matching condition, and action represents the specific processing to be done.

  • Using double / to represent a regular match, the effect is the same as the previous grep
rosaany@Rosefinch:~$ awk  "/(seven|home)/" test.txt
hello from sevenriby
hello from testerhome

pettern syntax can replace grep to some extent, but it is not concise

  • Find row 3 data
rosaany@Rosefinch:~$ awk 'NR>=3' test.txt
hello from testerhome
# The NR parameter represents the number of records

pattern has a very rich grammar. You can practice it yourself after class. At the same time, awk also has several standard built-in variables.

  1. FS indicates the field separator
  2. OFS represents the field separator of the output data
  3. RS indicates the record separator`
  4. ORS represents the row separator of the output field
  5. NF indicates the number of fields
  6. NR indicates the number of records
  • Find the number of file records and fields
rosaany@Rosefinch:~$ awk '{print NR,NF}' test.txt
1 3
2 3
3 3
# The default space is used as the separator. The first line has 3 fields, the second line has 3 fields, and the third line has 3 fields
  • Find the number of file records and fields (specify the separator o, and use the - F parameter)
rosaany@Rosefinch:~$ awk -Fo '{print NR,NF,$1,$2,$3,$4}' test.txt
1 4 hell  fr m h gwarts
2 3 hell  fr m sevenriby
3 4 hell  fr m testerh me
# $1~$n output corresponding records
  • You can also use the BEGIN directive delimiter
rosaany@Rosefinch:~$ awk  'BEGIN{FS="o"}{print NR,NF,$1,$2,$3,$4}' test.txt
1 4 hell  fr m h gwarts
2 3 hell  fr m sevenriby
3 4 hell  fr m testerh me
# FS variable specifies the delimiter
  • Specify field separator for output data|
rosaany@Rosefinch:~$ awk  'BEGIN{OFS="|"}{print NR,NF,$1,$2,$3,$4}'test.txt
1|3|hello|from|hogwarts|
2|3|hello|from|sevenriby|
3|3|hello|from|testerhome|
  • The output data specifies the field separator |, which is directly specified by OFS
rosaany@Rosefinch:~$ awk  'OFS="|"{print NR,NF,$1,$2,$3,$4}' test.txt
1|3|hello|from|hogwarts|
2|3|hello|from|sevenriby|
3|3|hello|from|testerhome|
  • awk also supports simple arithmetic functions
rosaany@Rosefinch:~$ awk 'BEGIN{print 10/3}'
3.33333

In addition to these, awk also supports dictionaries to count some features and data. It is similar to Java hash and Python dictionaries. Awk's syntax is very flexible. I hope you can print out the document and read it carefully after class. It can help you be more handy in data analysis in the future.

3.3 sed

The specific common methods of sed are as follows:

  1. sed[addr]X[options], where [] defines a range, X bit is the specific operation, and options represents the options for data modification.
  2. -e means that an expression can be specified.
  3. sed -n '2p' 2 means to print the data of the second line
  4. s means find and replace
  5. -i means to modify the source file directly
  6. -E supports extended expressions.
  • Use s to find the previous content and replace with the following content
rosaany@Rosefinch:~$ sed 's#test#testing#' test.txt
hello from hogwarts
hello from sevenriby
hello from testingerhome
# '#’Either '/' or '/' can represent a separator
# testerhome becomes testingerhome
  • Replace the three characters beginning with t with xxx
rosaany@Rosefinch:~$ sed 's/t../xxx/g' test.txt
hello from hogwarts
hello from sevenriby
hello from xxxxxxhome
# s/../../g indicates global replacement
# testerhome becomes xxxxxxhome

If you want to give a specific number of lines or range, replace and modify it through regular matching

rosaany@Rosefinch:~$ sed '3,$ s/t../xxx/g' test.txt # Directly specify the range ($indicates to the last line)
hello from hogwarts
hello from sevenriby
hello from xxxxxxhome
  • Delete specified row
rosaany@Rosefinch:~$ sed '1 d' test.txt
hello from sevenriby
hello from testerhome

awk focuses more on data extraction, while sed focuses more on data modification. The important role of SED is to complete data addition, deletion, modification and query, such as:

  1. d is deleted
  2. p is print
  3. s is find and replace
  4. \1 \ 2 grouping processing can be performed according to the matching data

4. Pipeline combination

Pipe symbol |, which means that the output of the previous instruction will automatically become the input of the next instruction in the shell.

  • Given a specific number of lines or range, it can be replaced and modified by regular matching
rosaany@Rosefinch:~$ awk 'NF<2' test.txt | sed 's/t../xxx/g'
rosaany@Rosefinch:~$ awk 'NF>2' test.txt | sed 's/t../xxx/g'
hello from hogwarts
hello from sevenriby
hello from xxxxxxhome
  • Combination of grep, awk and sed
rosaany@Rosefinch:~$ cat test.txt | grep hogwarts | awk '{print $3}' | sed 's/h../xxx/g'
# -----------------Output-----------------
xxxwarts
# Enter the cat command to specify the output file, then grep only keeps the line where hogwarts is located, then awk prints the third field, and finally sed replaces the three characters beginning with h with xxx.

Through the pipeline, we can easily bring the functions of Linux three swordsmen to a new level. With the pipeline, many operations become very simple and easy to handle. Through the combination of pipeline and three swordsmen, we can achieve very good results. It can help us deal with some complex data processing work and improve our work efficiency.

Reference articles

Partial quotation: 46 lectures on core technology of test and development -- three swordsmen of Linux, from lague Education
Reference: man sed, man grep, man awk

Posted by Apollo_Ares on Sat, 27 Nov 2021 21:14:55 -0800