Explain how awk tools are used

Original address: https://my.oschina.net/jarly/blog/898144

When you first pick up your hands and use the awk command to process one or more files on your computer, it reads each line of the file in turn, and then processes it. The awk command defaults to getting the file content from stdio standard input. awk uses a pair of single quotes to represent some executable script code. Within the executable script code, awk uses a pair of curly braces. Represents an executable block of code that can exist simultaneously. Each curly bracket of awk can have multiple instructions at the same time. Each instruction is separated by semicolons. awk is actually a script programming language. Having said so much, you must still be confused. You guessed it. These are all nonsense. Don't worry, guest officer, please look down...

Basic format of awk command

awk [options] 'program' file

options is an optional parameter option that you like to use, but you don't need to pull it.
program, which represents awk's executable script code, is a must.
File is a file that means awk needs to process. Note that it's a plain text file, not your mp3 or mp4.

Let's start with an awk example to warm up

$ awk '{print $0}' /etc/passwd

The executable script code of the awk command is enclosed in single quotation marks, followed by a pair of curly braces, remember that the curly braces are not "flower girls", and then there are some executable script code segments in the curly braces. When awk reads a line, it executes each script code segment in double quotation marks in turn. In the example above, $0 represents the current line. . When you execute the above command, it prints out each line of the / etc/passwd file in turn. You must be thinking: This is a big deal, and you can do it with the cat command. You're right! The above command is useless. Please look down.

awk custom separator

The default separators for awk are spaces and tabs, and we can specify separators using the - F parameter

$ awk -F ':' '{print $1}' /etc/passwd
root
bin
daemon
adm
lp
sync
shutdown
halt
mail
operator
games
ftp
nobody

The above command colons each line in the / etc/passwd file: divide it into multiple fields, and print out the contents of the first column field with print

How to specify multiple separators in awk at the same time

For example, there is now a file named some.log that reads as follows

Grape(100g)1980
raisins(500g)1990
plum(240g)1997
apricot(180g)2005
nectarine(200g)2008

Now we want to divide the above some.log file according to the "fruit name (weight) year"

$ awk -F '[()]' '{print $1, $2, $3}' some.log
Grape 100g 1980
raisins 500g 1990
plum 240g 1997
apricot 180g 2005
nectarine 200g 2008

A pair of square brackets is used in the - F parameter to specify multiple separators. When awk processes some.log files, it uses "(" and ") to split each line of the file.

Use of awk built-in variables

$0 represents the current line of text processing

$1 denotes the first field column after the text line is separated

$2 represents the second field column after the text row has been split

$3 represents the third field column after the text row has been split

$n denotes the nth field column after the text row has been split

NR represents the line number in the file, indicating which line is currently in the file

NF represents the number of rows and columns in the file, similar to how many fields are there for each record in the mysql data table

FS represents the input delimiter of awk. The default delimiter is space and tab. You can customize it.

OFS represents the output delimiter of awk, which defaults to space, and you can customize it as well.

FILENAME denotes the file name of the current file. If multiple files are processed simultaneously, it also denotes the current file name.

For example, we have a text file fruit.txt which I will use to show you how to use the awk command tool, and by the way, to activate the awkward atmosphere at this time.

peach    100   Mar  1997   China
Lemon    150   Jan  1986   America
Pear     240   Mar  1990   Janpan
avocado  120   Feb  2008   china

Let's take a look at the following simple to explosive examples, which represent the contents of each full line of the printed output file.

$ awk '{print $0}' fruit.txt
peach    100   Mar  1997   China
Lemon    150   Jan  1986   America
Pear     240   Mar  1990   Janpan
avocado  120   Feb  2008   china

The following represents the first column of each line of the printed output file

$ awk '{print $1}' fruit.txt
peach
Lemon
Pear
avocado

The following shows the contents of columns 1, 2, and 3 for each row of the printed output file

$ awk '{print $1, $2, $3}' fruit.txt
peach 100 Mar
Lemon 150 Jan
Pear 240 Mar
avocado 120 Feb

A comma is added to indicate the insertion of an output separator, which is the default space.

In addition to printing output with print command, the contents of each column in each line of a file can also be assigned values.

$ awk '{$2 = "***"; print $0}' fruit.txt
peach *** Mar 1997 China
Lemon *** Jan 1986 America
Pear *** Mar 1990 Janpan
avocado *** Feb 2008 china

The above example is to hide the second column content of each row by reassigning the $2 variable and replacing its output with an asterisk *.

Add strings or escape characters to the parameter list

$ awk '{print $1 "\t" $2 "\t" $3}' fruit.txt
peach   100     Mar
Lemon   150     Jan
Pear    240     Mar
avocado 120     Feb

As above, you can add strings or escape characters to the print parameter list to make the output format more beautiful, but remember to use double quotation marks.

The awk built-in NR variable represents the line number of each line

$ awk '{print NR "\t" $0}' fruit.txt
1   peach    100   Mar  1997   China
2   Lemon    150   Jan  1986   America
3   Pear     240   Mar  1990   Janpan
4   avocado  120   Feb  2008   china

The awk built-in NF variable represents the number of columns per row

$ awk '{print NF "\t" $0}' fruit.txt
5   peach    100   Mar  1997   China
5   Lemon    150   Jan  1986   America
5   Pear     240   Mar  1990   Janpan
5   avocado  120   Feb  2008   china

Use of the $NF variable in awk

$ awk '{print $NF}' fruit.txt
China
America
Janpan
china

The $NF above represents the last column in each row, because it represents the total number of columns in a row, and in this file it represents five columns, and then preceded by the $symbol, it becomes $5, representing column 5.

$ awk '{print $(NF - 1)}' fruit.txt
1997
1986
1990
2008

Above, $(NF-1) represents the penultimate column, $(NF-2) represents the penultimate column, and so on.

Now, in addition to the fruit.txt file we just mentioned, we have a new file called company.txt, which reads as follows

yahoo   100 4500
google  150 7500
apple   180 8000
twitter 120 5000

Let's use fruit.txt and company.txt to show you how awk works when it processes multiple files at the same time.

$ awk '{print FILENAME "\t" $0}' fruit.txt company.txt
fruit.txt       peach    100   Mar  1997   China
fruit.txt       Lemon    150   Jan  1986   America
fruit.txt       Pear     240   Mar  1990   Janpan
fruit.txt       avocado  120   Feb  2008   china
company.txt     yahoo   100 4500
company.txt     google  150 7500
company.txt     apple   180 8000
company.txt     twitter 120 5000

When you use awk to process multiple files at the same time, it merges multiple files. The variable FILENAME represents the name of the file where the current text line is located.

See here is not feel awk command use is really simple to explosion, now don't be too happy, please raise your hands and shake with me... Oh, no! Please pick up your hands and try these examples on the computer.
You'll know that I didn't lie to you, because with all this talk, fools will... _______________

The Use of BEGIN Keyword

When the BEGIN keyword is used before the script code segment, it runs the script code segment after the BEGIN keyword once before starting to read a file.
The script snippet after BEGIN executes only once, and the awk program exits after execution.

$ awk 'BEGIN {print "Start read file"}' /etc/passwd
Start read file

In awk scripts, you can use multiple curly braces to execute multiple script codes, as follows

$ awk 'BEGIN {print "Start read file"} {print $0}' /etc/passwd
Start read file
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
nobody:x:99:99:Nobody:/:/sbin/nologin

END keyword usage

The END instruction of awk is exactly the opposite of BEGIN. After awk reads and processes all the content lines of the file, the script code segment following END is executed.

$ awk 'END {print "End file"}' /etc/passwd
End file

It's good for your health to knock these commands on your computer more often. Brain is a good thing. Use it more.

$ awk 'BEGIN {print "Start read file"} {print $0} END {print "End file"}' /etc/passwd
Start read file
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
nobody:x:99:99:Nobody:/:/sbin/nologin
End file

Using variables in awk

Variables can be declared and used in awk scripts

$ awk '{msg="hello world"; print msg}' /etc/passwd
hello world
hello world
hello world
hello world
hello world

Variables declared by awk can be used in any number of curly bracket scripts

$ awk 'BEGIN {msg="hello world"} {print msg}' /etc/passwd
hello world
hello world
hello world
hello world
hello world

Using Mathematical Operations in awk

In awk, like other programming languages, it also supports some basic mathematical operations

$ awk '{a = 12; b = 24; print a + b}' company.txt
36
36
36
36

The script above indicates that two variables a = 12 and b = 24 are declared first, and then print out the results of a plus b.

When you see the output above, you are likely to be confused again. Why do you repeat the same result four times? So when I was young, I didn't learn well. When I grew up, I did IT.
Knowledge is something that really needs to be used. It can blindness other people's eyes. Okay, no nonsense. Keep in mind that awk executes a single quotation mark for each line of the file
The script code in it will execute once every line read, how many lines in the file will execute how many times, but after the BEGIN and END keywords
Except script code, if there is nothing in the file being processed, awk will not execute at one time...

awk also supports other mathematical operators

+ additive operator

- subtraction operator

* multiplication operator

/ division operator

% Remainder operator

Use conditional judgment in awk

For example, there is a file company.txt that reads as follows

yahoo   100 4500
google  150 7500
apple   180 8000
twitter 120 5000

We need to determine the third column of the document, which is the company whose average salary is less than 5,500, and print it out.

$ awk '$3 &lt; 5500 {print $0}' company.txt
yahoo   100 4500
twitter 120 5000

The result of the above command is a list of companies with an average salary of less than 5500, $3 & lt; 5500 indicates that the following {print $0} block is executed when the content of the third column field is less than 5500.

$ awk '$1 == "yahoo" {print $0}' company.txt
yahoo   100 4500

awk has other conditional operators as follows

& lt; less than

& lt; = less than or equal to

== Equivalent to

! = Not equal to

& gt; greater than

& gt; = greater than or equal to

~ Matching Regular Expressions

!~Mismatched Regular Expressions

Use if instruction judgment to achieve the same effect as above

$ awk '{if ($3 &lt; 5500) print $0}' company.txt
yahoo   100 4500
twitter 120 5000

The above indicates that if column 3 field is less than 5500, the following print $0 will be executed, much like the grammar of C and PHP is not correct.
When I think of this, I don't know what to say. That's PHP is the best language in the world... I may have drunk too much.
But it suddenly occurred to me that I never drank. _______________

Use regular expressions in awk

Support regular expressions in awk. If you still don't know about regular expressions, stop and go to google to search for them.

For example, now we have a file called poetry.txt, which contains all my poems. Don't ask me why I am so talented. The contents are as follows:

This above all: to thine self be true
There is nothing either good or bad, but thinking makes it so
There's a special providence in the fall of a sparrow
No matter how dark long, may eventually in the day arrival

Use regular expressions to match the string "There" to print and output the lines containing the string

$ awk '/There/{print $0}' poetry.txt
There is nothing either good or bad, but thinking makes it so
There's a special providence in the fall of a sparrow

Use regular expressions to match a line containing the letters t and e, and there can only be any single character between t and e

$ awk '/t.e/{print $0}' poetry.txt
There is nothing either good or bad, but thinking makes it so
There's a special providence in the fall of a sparrow
No matter how dark long, may eventually in the day arrival

If you only want to match the simple string "t.e", the regular expression is like this / t e/, which is escaped with backslashes.
Because any single character is represented in a regular expression.

Use regular expressions to match all rows starting with the "The" string

$ awk '/^The/{print $0}' poetry.txt
There is nothing either good or bad, but thinking makes it so
There's a special providence in the fall of a sparrow

In regular expressions, the ^ expression begins with a character or string.

Use regular expressions to match all lines ending with a "true" string

$ awk '/true$/{print $0}' poetry.txt
This above all: to thine self be true

In regular expressions, $denotes ending with a character or string.

Another example of regular expressions is as follows

$ awk '/m[a]t/{print $0}' poetry.txt
No matter how dark long, may eventually in the day arrival

The above regular expression / m[a]t / indicates that the matching contains the character m, then a single character a in the middle brackets, and finally a line of the character t. Only the word "matter" in the output matches the regular expression. Because the square brackets of the regular expression [a] denote any single character within the match.

Continue with a new example above as follows

$ awk '/^Th[ie]/{print $0}' poetry.txt
This above all: to thine self be true
There is nothing either good or bad, but thinking makes it so
There's a special providence in the fall of a sparrow

In this example, the regular expression /^ Th[ie]/ denotes a line that matches a string beginning with "Thi" or "The", and the square brackets of the regular expression denote any single character that matches it.

Continue with the new usage above.

$ awk '/s[a-z]/{print $0}' poetry.txt
This above all: to thine self be true
There is nothing either good or bad, but thinking makes it so
There's a special providence in the fall of a sparrow

The regular expression / s[a-z]/ denotes matching strings containing characters s followed by any single character between a and z, such as "se", "so", "sp", and so on.

There are other uses in square brackets of regular expressions such as the following

[a-zA-Z] represents a single character between lowercase A and z, or between uppercase A and Z.
The [^ a-z] symbol `^'in square brackets denotes negativity, that is, nonsense, and matches any single character that is not a to Z.

The usage of asterisk * and Plus + in regular expressions

$ awk '/go*d/{print $0}' poetry.txt
There is nothing either good or bad, but thinking makes it so

The above representation contains the string "god" and the middle letter "o" can appear zero or more times, such as the word "good" meets this requirement.
The principle of + in regular expressions is similar to that of asterisks, except that the plus sign means any one or more, that is, it must occur at least once.

The Use of Question Marks in Regular Expressions

$ awk '/ba?d/{print $0}' poetry.txt
There is nothing either good or bad, but thinking makes it so

The question mark in a regular expression means that the character in front of it can only appear 0 or 1 times, that is, it can not appear, it can also appear, but if it does, it can only appear once.

Use of {} curly brackets in regular expressions

$ awk '/go{2}d/{print $0}' poetry.txt
There is nothing either good or bad, but thinking makes it so

Curly brackets {} denote the number of times a character must appear before it, as the above / go{2}d / denotes that only the string "good" matches, that is, the letter "o" in the middle.
It must occur twice.

There are some other uses of curly braces in regular expressions as follows

/ go{2,3}d / indicates that the letter "o" can only appear twice or three times
 / go{2,10}d/ means the letter "o" can only appear twice, three, four, five, six.... up to 10 times.
/ go{2,}d/ means that the letter "o" must appear at least twice or more.

Use of parentheses in regular expressions

$ awk '/th(in){1}king/{print $0}' poetry.txt
There is nothing either good or bad, but thinking makes it so

The parentheses in regular expressions indicate that multiple characters are treated as a complete object. For example, / th(in){1}king / means that the string "in" must appear once. Without parentheses, it becomes / thin{1}king / which means that the character "n" must appear once.

See here, if you are familiar with the poems written in poetry.txt, you will find that... What the hell! This poem was not written by me at all. So how important it is to read more books. I have the privilege of borrowing Shakespeare's poems to show you how to use regular expressions in awk. Now it's time to think about what to eat in the evening and how to eat hot pot in the evening. _______________

Some summaries of using awk

Because awk is also a programming language, its function is far more than what we mentioned above, awk has other more complex functions. But generally we don't recommend awk to be too complicated. Usually in the face of more complex scenarios, we still need to use some other tools, such as shell scripts, Lua and so on.

This document uses CC-BY 4.0 protocol

Posted by nickcwj on Sat, 29 Jun 2019 14:48:28 -0700

Programmer Group

Explain how awk tools are used

Some summaries of using awk

Hot Keywords