Production tuning 1 HDFS core parameters

Keywords: Hadoop

catalogue

1 HFDS core parameters

Parameters that must be referenced when building HFDS clusters

1.1 NameNode memory production configuration

Problem description

1) NameNode memory calculation

Each file block occupies about 150byte. Taking 128G memory of a server as an example, how many file blocks can be stored?
128 * 1024 * 1024 * 1024 / 150Byte ≈ 910 million

2) Hadoop 3. X series, configuring NameNode memory

/hadoop-env.sh in opt/module/hadoop-3.1.3/etc/hadoop path describes that Hadoop memory is dynamically allocated

# The maximum amount of heap to use (Java -Xmx). If no unit
# is provided, it will be converted to MB. Daemons will
# prefer any Xmx setting in their respective _OPT variable.
# There is no default; the JVM will autoscale based upon machine / / if this parameter is not set, the value will be assigned according to the server memory
# memory size.
# export HADOOP_HEAPSIZE_MAX=

# The minimum amount of heap to use (Java -Xms). If no unit
# is provided, it will be converted to MB. Daemons will
# prefer any Xms setting in their respective _OPT variable.
# There is no default; the JVM will autoscale based upon machine
# memory size.
# export HADOOP_HEAPSIZE_MIN=

View memory occupied by NameNode

[ranan@hadoop102 hadoop]$ jpsall
=============== hadoop102 ===============
15473 JobHistoryServer
15268 NodeManager
14933 DataNode
15560 Jps
14749 NameNode
=============== hadoop103 ===============
13969 Jps
13218 DataNode
13717 NodeManager
13479 ResourceManager
=============== hadoop104 ===============
13012 Jps
12869 NodeManager
12572 DataNode
12750 SecondaryNameNode

[ranan@hadoop102 hadoop]$ jmap -heap 14749
Heap Configuration:
MaxHeapSize  = 1023410176 (976.0MB)

View the memory occupied by DataNode

[ranan@hadoop102 hadoop]$ jmap -heap 14749
MaxHeapSize  = 1023410176 (976.0MB)

It is found that the memory occupied by NameNode and DataNode on Hadoop 102 is automatically allocated and equal. If both reach the upper limit at the same time (the system is 976M), it is obvious that there is insufficient memory, and it will seize the memory of the linux system, which is unreasonable. Therefore, manual configuration is required.

Configuration in hadoop-env.sh

Experience:

Specific modification: hadoop-env.sh

[ranan@hadoop102 hadoop]$ vim hadoop-env.sh
[ranan@hadoop102 hadoop]$ vim hadoop-env.sh

//Select splice without modifying the default configuration information- Xmx1024m for splicing
export HDFS_NAMENODE_OPTS="-Dhadoop.security.logger=INFO,RFAS -Xmx1024m" 
export HDFS_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS -Xmx1024m"

[ranan@hadoop102 hadoop]$ xsync hadoop-env.sh

Restart cluster

[ranan@hadoop102 hadoop]$ myhadoop.sh stop

[ranan@hadoop102 hadoop]$ myhadoop.sh start

View NameNode memory

[ranan@hadoop103 hadoop]$ jpsall
=============== hadoop102 ===============
22292 NodeManager
22500 JobHistoryServer
21957 DataNode
21766 NameNode
22598 Jps
=============== hadoop103 ===============
19041 Jps
18531 ResourceManager
18824 NodeManager
18314 DataNode
=============== hadoop104 ===============
16787 SecondaryNameNode
16602 DataNode
16906 NodeManager
17069 Jps
[ranan@hadoop102 hadoop]$ jmap -heap 21766

View DataNode memory

[ranan@hadoop102 hadoop]$ jmap -heap 21957

1.2 NameNode heartbeat concurrency configuration

DataNode working mechanism

After the DataNode is started, it tells the NameNode the local block information (whether the block is intact) and periodically (6 hours by default) reports all block messages (whether the block is intact).

If there are a large number of datanodes, the NameNode will prepare a thread to handle the report, and the client may also request from the NameNode.

Question: how many threads are appropriate for NameNode?

Modify hdfs-site.xml configuration

NameNode has a worker thread pool, which is used to handle concurrent heartbeats of different datanodes and concurrent metadata operations of clients.

Parameter dfs.namenode.hakdler.count. The default value is 10

[ranan@hadoop102 ~]$ sudo yum install -y python
[ranan@hadoop102 ~]$ python

Python 2.7.5 (default, Apr 11 2018, 07:36:10)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more
information.
>>> import math
>>> print int(20*math.log(3))
21
>>> quit()

hdfs-site.xml NEW

[ranan@hadoop102 hadoop]$ vim hdfs-site.xml 

<!-- newly added -->
<property>
<name>dfs.namenode.handler.count</name>
<value>21</value>
</property>

1.3 enable recycle bin configuration

When the recycle bin function is enabled, the deleted files can be restored to the original data without timeout, so as to prevent accidental deletion and backup. It is disabled by default.

Recycle bin mechanism

Assuming that the file survival time is set to fs.trash,interval = 6060min, check the file survival time in the recycle bin every 10min, fs.trash,checkpoint.interval=10. If it is not set, it is consistent with the file survival time interval, and check it every 60min

Description of function parameters for opening recycle bin

(1) The default value is fs.trash.interval = 0. 0 means the recycle bin is disabled; Other values indicate the lifetime of the setup file.
(2) The default value is fs.trash.checkpoint.interval = 0, which is the interval between checking the recycle bin. If the value is 0, the value setting is equal to the parameter value of fs.trash.interval.
(3) Fs.trash.checkpoint.interval < = fs.trash.interval is required.

Start Recycle Bin - modify core-site.xml

Modify the core-site.xml and configure the garbage collection time to be 1 minute.

<!--newly added-->
<property>
<name>fs.trash.interval</name>
<value>1</value>
</property>

View recycle bin

Path of recycle bin directory in HDFS cluster: / user/ranan/.Trash /

Note: deleting directly on the web page does not go to the recycle bin

The files deleted through the program will not go through the recycle bin. You need to call moveToTrash() to enter the recycle bin

Trash trash = New Trash(conf);
trash.moveToTrash(path); //Add to recycle bin after deletion

Only files deleted on the command line with the hadoop fs -rm command will go to the recycle bin.

[ranan@hadoop102 hadoop]$ hadoop fs -rm /test.txt
2021-11-04 16:25:53,705 INFO fs.TrashPolicyDefault: Moved: 'hdfs://hadoop102:8020/test.txt' to trash at: hdfs://hadoop102:8020/user/ranan/.Trash/Current/test.txt

Recover recycle bin data

[ranan@hadoop102 hadoop-3.1.3]$ hadoop fs -mv /user/ranan/.Trash/Current/test.txt /

Posted by romic on Thu, 04 Nov 2021 15:44:50 -0700