shell data filtering

Keywords: shell

Let's talk about data analysis:

  • Data collection: responsible for data collection
  • Data cleaning: responsible for data filtering
  • Data analysis: data operation and sorting
  • Data display: output results in chart or table mode

shell script data processing

1) Data retrieval: grep tr cut

2) Data processing: uniq sort tee paste xargs

In the previous scripts, we used grep, cut, tr, uniq, sort and other commands to retrieve strings through pipes, and then obtained the results through the corresponding operations in the shell. You may also experience the hard work and poor performance in the process of data retrieval. No way, that's all. We still need to complete the task.

Disadvantages: complex command combinations

Multiple operations

It's hard to get started

terms of settlement

Well, after learning this lesson, all your previous pains can be solved. Today I want to introduce you to a more powerful command awk. It allows you to retrieve the data you need from the output stream without using a large number of command combinations as before. It can be completed with only one command awk. It can also process data through awk without additional shell operations.

Application scenarios of awk

String interception

Data operation

Such as memory usage scripts

shell processing of output stream awk

1. awk introduction

In daily computer management, there will always be a lot of data output to the screen or file. These outputs include standard output and standard error output. By default, all this information is output to the default output device - screen. However, only a small part of the large amount of data output is what we need to focus on. We need to filter or extract the information we need or pay attention to for subsequent calls. In the previous study, we learned to use grep to filter these data and use cut and tr commands to propose some fields, but they do not have the ability to extract and process data. They must first filter, then extract and transfer them to variables, and then process them through variable extraction, such as:

Statistical steps for memory usage

1) adopt free -m Extract the total memory and assign it to the variable memory_totle

2)adopt free -m Extract n Memory usage, assigned to variable memory_use

3)Calculate memory usage through mathematical operations

You need to execute multiple steps to get the memory utilization. Is there a command that integrates filtering, extraction and operation? Of course, it's the command I'm going to introduce to you today: awk

Parallel commands include gawk, pgawk and dgawk

awk is a powerful language that can process data and generate formatted reports. awk considers that each line in the file is a record, the separator between records is a newline character, and each column is a field. The separator between fields is one or more spaces or tab characters by default

awk works by reading data, treating each row of data as a record. Each record is divided into several fields with field separators, and then the values of each field are output

2. awk syntax

awk [options] '[BEGIN]{program}[END]' [FILENAME]

Common command options
-F fs Specifies that the file separator that depicts the data field in a row defaults to a space
-f file Specifies the file name of the reader
-v var=value definition awk Variables and defaults used in the program

be careful: awk The procedure is defined by the left brace and the right brace. Program commands must be placed between two braces. because awk The command line assumes that the program is a single text string, so you must include the program in single quotes.
1)The program must be enclosed in curly braces
2)The program must be enclosed in single quotes

awk Program running priority is:
    1)BEGIN: Execute before starting processing the data stream, optional
    2)program: How to process data flow is required
    3)END: Execute after processing the data stream. Optional

3. Basic application of awk

Be able to skillfully use awk to intercept rows, columns and strings of standard output

Learning use cases
[root@zutuanxue ~]# cat test 
1 the quick brown fox jumps over the lazy cat . dog
2 the quick brown fox jumps over the lazy cat . dog
3 the quick brown fox         jumps over the lazy cat . dog
4 the quick brown fox jumps over the lazy cat . dog
5 the quick brown fox jumps over the lazy cat . dog

3.1) awk extraction of fields (columns)

Field extraction: extract a column of data in a text and print it out

Field related built-in variables

$0 represents the entire line of text

$1 represents the first data field in the text line

$2 represents the second data field in the text line

$N represents the nth data field in the text line

$NF represents the last data field in the text line

Read in test Each line of data and print out each line of data
[root@zutuanxue ~]# awk '{print $0}' test 
1 the quick brown fox jumps over the lazy cat . dog
2 the quick brown fox jumps over the lazy cat . dog
3 the quick brown fox         jumps over the lazy cat . dog
4 the quick brown fox jumps over the lazy cat . dog
5 the quick brown fox jumps over the lazy cat . dog

Print test Sixth field
[root@zutuanxue ~]# awk '{print $6}' test
jumps
jumps
jumps
jumps
jumps


Print test Last field
[root@zutuanxue ~]# awk '{print $NF}' test
dog
dog
dog
dog
dog

3.2) detailed explanation of command options

-F: Specifies the separator between fields

When the output data stream field format is not the awk default field format, we can use the - F command option to redefine the data stream field separator. For example:

The processed file is / etc/passwd. You want to print the first column, the third column and the last column

[root@zutuanxue ~]# awk -F ':' '{print $1,$3,$NF}' /etc/passwd
root 0 /bin/bash
bin 1 /sbin/nologin
daemon 2 /sbin/nologin
adm 3 /sbin/nologin
lp 4 /sbin/nologin
sync 5 /bin/sync
shutdown 6 /sbin/shutdown
halt 7 /sbin/halt
mail 8 /sbin/nologin
operator 11 /sbin/nologin
games 12 /sbin/nologin
ftp 14 /sbin/nologin
nobody 99 /sbin/nologin
systemd-network 192 /sbin/nologin
dbus 81 /sbin/nologin
polkitd 999 /sbin/nologin
postfix 89 /sbin/nologin
chrony 998 /sbin/nologin
sshd 74 /sbin/nologin
ntp 38 /sbin/nologin
tcpdump 72 /sbin/nologin
nscd 28 /sbin/nologin
mysql 997 /sbin/nologin
www 996 /sbin/nologin
apache 48 /sbin/nologin
tss 59 /sbin/nologin
zabbix 995 /sbin/nologin
saslauth 994 /sbin/nologin
grafana 993 /sbin/nologin

You can see, awk The default separator for the output field is also a space

-f file: if the awk command repeats daily work without too many changes, you can write the program to the file. It's good to use - f to call the program file every time, which is convenient and efficient.

[root@zutuanxue ~]# cat abc 
{print $1,$3,$NF}

[root@zutuanxue ~]# awk -f abc test
1 quick dog
2 quick dog
3 quick dog
4 quick dog
5 quick dog

-v defines variables. Since the author writes awk according to the language, the most important element in the language - variables must not be absent, so you can use the - v command option to define variables

[root@zutuanxue ~]# awk -v name='baism' 'BEGIN{print name}'
baism

A variable is defined name=baism,The variables are then read out.

3.3) awk extraction of records (lines)

Record extraction: extract a line in a text and print it out

There are two methods to extract records: a. by line number b. by regular matching

Record related built-in variables

NR: specify line number number row

extract test Third row data

Specify a line number of 3
[root@zutuanxue ~]# awk 'NR==3{print $0}' test 
3 the quick brown fox         jumps over the lazy cat . dog

The exact match string for the first field of the specified line is 3
[root@zutuanxue ~]# awk '$1=="3"{print $0}' test 
3 the quick brown fox         jumps over the lazy cat . dog

3.4) awk string extraction

The point where records and fields meet is a string

Print test The sixth field in the third row
[root@zutuanxue ~]# awk 'NR==3{print $6}' test
jumps

4. Priority of awk program

Priority of awk code block

As for the execution priority of awk PROGRAM, BEGIN is the code block with the highest priority, which is executed before executing PROGRAM. There is no need to provide data source, because it does not involve any data processing and does not depend on PROGRAM code blocks; PROGRAM is a required code block and the default code block for data flow. Therefore, the data source must be provided during execution; END is the operation after processing the data stream. If the END code block needs to be executed, it must be supported by PROGRAM. A single code block cannot be executed.

BEGIN: what to do before processing the data source? It can be executed without the data source

PROGRAM: what to do with the data source [required by default]

END: what to do after processing the data source? The program needs the data source

Priority display
[root@zutuanxue ~]# awk 'BEGIN{print "hello zutuanxue"}{print $0}END{print "bye zutuanxue"}' test
hello zutuanxue
1 the quick brown fox jumps over the lazy cat . dog
2 the quick brown fox jumps over the lazy cat . dog
3 the quick brown fox         jumps over the lazy cat . dog
4 the quick brown fox jumps over the lazy cat . dog
5 the quick brown fox jumps over the lazy cat . dog
bye zutuanxue

No data source is required and can be executed directly
[root@zutuanxue ~]# awk 'BEGIN{print "hello world"}'
hello world

No data stream was provided, so it cannot be executed successfully
[root@zutuanxue ~]# awk '{print "hello world"}'
[root@zutuanxue ~]# awk 'END{print "hello world"}'

5. awk advanced applications

awk is a language, so it will conform to the characteristics of the language. In addition to defining variables, it can also define arrays, perform operations and process control. Let's take a look next.

5.1) awk define variables and arrays

Define variables

[root@zutuanxue ~]# awk -v name='baism' 'BEGIN{print name}'
baism

[root@zutuanxue ~]# awk 'BEGIN{name="baism";print name}'
baism

Array definition method: array name [index] = value

Define array array,There are two elements, respectively 100200, to print array elements.
[root@zutuanxue ~]# awk 'BEGIN{array[0]=100;array[1]=200;print array[0],array[1]}'
100 200

[root@zutuanxue ~]# awk 'BEGIN{a[0]=100;a[1]=200;print a[0]}'
100
[root@zutuanxue ~]# awk 'BEGIN{a[0]=100;a[1]=200;print a[1]}'
200

5.2) awk operation

  1. Assignment operation=
  2. Comparison operation > > = = < < ==
  3. Mathematical operations + - * /% * * + –
  4. Logical operation & & |!
  5. Matching operation ~~ Exact match = ==

a. Assignment operation: it mainly assigns values to variables or arrays, such as:

Variable assignment name = "baism" school = "zutuanxue"

Array assignment array[0]=100

[root@zutuanxue ~]# awk -v name='baism' 'BEGIN{print name}'
baism

[root@zutuanxue ~]# awk 'BEGIN{school="zutuanxue";print school}'
zutuanxue

[root@zutuanxue ~]# awk 'BEGIN{array[0]=100;print array[0]}'
100

b. Comparison operation. If the comparison is a string, it is compared according to the ascii coding sequence table. If the result returns true, it is represented by 1, and if the result returns false, it is represented by 0

ascii

[root@zutuanxue ~]# awk 'BEGIN{print "a" >= "b" }'
0
[root@zutuanxue ~]# awk 'BEGIN{print "a" <= "b" }'
1

[root@zutuanxue ~]# awk '$1>4{print $0}' test
5 the quick brown fox jumps over the lazy cat . dog
[root@zutuanxue ~]# awk 'BEGIN{print 100 >= 1 }'
1
[root@zutuanxue ~]# awk 'BEGIN{print 100 == 1 }'
0
[root@zutuanxue ~]# awk 'BEGIN{print 100 <= 1 }'
0
[root@zutuanxue ~]# awk 'BEGIN{print 100 < 1 }'
0
[root@zutuanxue ~]# awk 'BEGIN{print 100 != 1 }'
1

c. Mathematical operation

[root@zutuanxue ~]# awk 'BEGIN{print 100+3 }'
103
[root@zutuanxue ~]# awk 'BEGIN{print 100-3 }'
97
[root@zutuanxue ~]# awk 'BEGIN{print 100*3 }'
300
[root@zutuanxue ~]# awk 'BEGIN{print 100/3 }'
33.3333
[root@zutuanxue ~]# awk 'BEGIN{print 100**3 }'
1000000
[root@zutuanxue ~]# awk 'BEGIN{print 100%3 }'
1
[root@zutuanxue ~]# awk -v 'count=0' 'BEGIN{count++;print count}'
1
[root@zutuanxue ~]# awk -v 'count=0' 'BEGIN{count--;print count}'
-1

d. Logical operation

And operation:True is true, true is false, false is false
[root@zutuanxue ~]# awk 'BEGIN{print 100>=2 && 100>=3 }'
1
[root@zutuanxue ~]# awk 'BEGIN{print 100>=2 && 1>=100 }'
0

Or operation:True is true, true is false, false is false
[root@zutuanxue ~]# awk 'BEGIN{print 100>=2 || 1>=100 }'
1
[root@zutuanxue ~]# awk 'BEGIN{print 100>=200 || 1>=100 }'
0

Non operation
[root@manage01 resource]# awk 'BEGIN{print ! (100>=2)}'
0

e. Matching operation

[root@zutuanxue ~]# awk -F ':' '$1 ~ "^ro" {print $0}' /etc/passwd
root:x:0:0:root:/root:/bin/bash

[root@zutuanxue ~]# awk -F ':' '$1 !~ "^ro" {print $0}' /etc/passwd
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
nobody:x:99:99:Nobody:/:/sbin/nologin
systemd-network:x:192:192:systemd Network Management:/:/sbin/nologin
dbus:x:81:81:System message bus:/:/sbin/nologin
polkitd:x:999:997:User for polkitd:/:/sbin/nologin
postfix:x:89:89::/var/spool/postfix:/sbin/nologin
chrony:x:998:996::/var/lib/chrony:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
ntp:x:38:38::/etc/ntp:/sbin/nologin
tcpdump:x:72:72::/:/sbin/nologin
nscd:x:28:28:NSCD Daemon:/:/sbin/nologin
mysql:x:997:995::/home/mysql:/sbin/nologin
www:x:996:994::/home/www:/sbin/nologin
apache:x:48:48:Apache:/usr/share/httpd:/sbin/nologin
tss:x:59:59:Account used by the trousers package to sandbox the tcsd daemon:/dev/null:/sbin/nologin
zabbix:x:995:993:Zabbix Monitoring System:/var/lib/zabbix:/sbin/nologin
saslauth:x:994:76:Saslauthd user:/run/saslauthd:/sbin/nologin
grafana:x:993:992:grafana user:/usr/share/grafana:/sbin/nologin

5.3) awk environment variables

FIELDWIDTHS:Redefine the column width and print. Note that it cannot be used $0 Print all because $0 This is to print the full content of the line, not the fields you defined
[root@zutuanxue ~]# awk 'BEGIN{FIELDWIDTHS="5 2 8"}NR==1{print $1,$2,$3}' /etc/passwd
root: x: 0:0:root

FS:Specify the field separator in the data source, similar to command options-F
[root@zutuanxue ~]# awk 'BEGIN{FS=":"}NR==1{print $1,$3,$NF}' /etc/passwd
root 0 /bin/bash

OFS:Specifies the separator for the field after output to the screen
[root@zutuanxue ~]# awk 'BEGIN{FS=":";OFS="-"}NR==1{print $1,$3,$NF}' /etc/passwd
root-0-/bin/bash

RS:Specifies the separator for the record
[root@zutuanxue ~]# awk 'BEGIN{RS=""}{print $1,$13,$25,$37,$49}' test
1 2 3 4 5

After changing the record separator to an empty line, all lines will become one line, so all fields will be on one line.

ORS:The separator of the record after output to the screen. The default is enter
[root@zutuanxue ~]# awk 'BEGIN{RS="";ORS="*"}{print $1,$13,$25,$37,$49}' test
1 2 3 4 5*[root@zutuanxue ~]# 

As you can see, the prompt and output are on the same line, because the default carriage return is changed to*

5.4) process control

if judgment statement

Learning use cases
[root@zutuanxue ~]# cat num
1
2
3
4
5
6
7
8
9

single if sentence
 Print $1 Rows greater than 5
[root@zutuanxue ~]# awk '{if($1>5)print $0}' num
6
7
8
9

if...else sentence
 If $1 If greater than 5, divide by 2 output, otherwise multiply by 2 output
[root@zutuanxue ~]# awk '{if($1>5)print $1/2;else print $1*2}' num
2
4
6
8
10
3
3.5
4
4.5

for loop statement

Learning use cases
[root@zutuanxue ~]# cat num2
60 50 100
150 30 10
70 100 40

Add up all the data in a row  $1+$2+$3
[root@zutuanxue ~]# awk '{sum=0;for (i=1;i<4;i++){sum+=$i}print sum}' num2
210
190
210

If you don't understand, you can see the following format
[root@zutuanxue ~]# awk '{
> sum=0
> for (i=1;i<4;i++) {
>     sum+=$i
> }
> print sum
> }' num2
210
190
210

while loop statement – judge before execute

Learning use cases
[root@zutuanxue ~]# cat num2
60 50 100
150 30 10
70 100 40

Accumulate the values of each line in the file, and stop accumulating when the sum is greater than or equal to 150
[root@zutuanxue ~]# awk '{sum=0;i=1;while(sum<150){sum+=$i;i++}print sum}' num2
210
150
170



If you don't understand, you can see the following format
[root@zutuanxue ~]# awk '{                                                 
sum=0
i=1
while (sum<150) {
   sum+=$i
   i++
}
print sum
}' num2
210
150
170

do... while loop statement – execute before Judge

Learning use cases
[root@zutuanxue ~]# cat num2
60 50 100
150 30 10
70 100 40

Accumulate the values of each line in the file, and stop accumulating when the sum is greater than or equal to 150
[root@zutuanxue ~]# awk '{sum=0;i=1;do{sum+=$i;i++}while(sum<150);print sum}' num2
210
150
170
 If you don't understand, you can see the following format
[root@zutuanxue ~]# awk '{
> sum=0
> i=1
> do {
> sum+=$i
> i++
> }while (sum<150)
> print sum
> }' num2
210
150
170

Loop control statement

break jumps out of the loop and continues to execute subsequent statements

Learning use cases
[root@zutuanxue ~]# cat num2
60 50 100
150 30 10
70 100 40

Accumulate the value of each line, and stop accumulation when the sum is greater than 150
[root@zutuanxue ~]# awk '{
> sum=0
> i=1
> while (i<4){
>    sum+=$i
>    if (sum>150){
>       break
>    }
>    i++
> }
> print sum
> }' num2
210
180
170

Posted by Suchy on Fri, 29 Oct 2021 09:24:12 -0700