Implementation mechanism of multi machine object storage

characteristic:

one   Simultaneous operation of multiple machines

two   When the client submits a large file, the server should judge the size of the file, then divide the file and execute some scheduling logic. Suppose it is divided into four parts, the scheduling logic should schedule the four subsets of the file, which specific data server should be placed on, and return the IP addresses of the four data servers to the client

three   The server records the correspondence between this file and the four dataserver s, which are called metadata. For example, a structured form like this:

id,filename,storage_path,total_size,splits,data_servers (there can be multiple), each data_ The specific location and each data stored on the server_ The offset position of the file fragment stored by the server

In other words, the metadata server process should calculate according to a file write request submitted by the client. What should be calculated?

  1. How big is this file
  2. It is divided into several parts, each of which calculates the offset according to bytes, startoffset~   endoffset, for example, is divided into four parts, which are 1-1000, 1001-2000, 2001-3000 and 3001-3555 bytes respectively. These four files are stored on four data servers
  3. Find out four data servers, for example, according to whether the data server is online, the remaining disk capacity of the data server, etc
  4. Return the ip addresses of the four data servers to the client
  5. The client connects four data servers at the same time, sends data to the four data servers respectively, writes the four subsets of files to the data server disk, and also remembers the specific storage locations on the four data servers, such as / data/companyname/projectname/abc/myfile001

four   When the client starts to query the meta server for a file, the meta server should find out which data servers the file is stored on according to the previous metadata information, and then let the client establish a tcp connection with the four data servers and download the data at the same time. Finally, the client splices the file itself

five   The whole cluster system should support the backup of copies of files. After the large files submitted by the above client are divided into four copies, they should be automatically backed up in the cluster, and they should be backed up across the data server. These backup information should be stored in the meta server database, and the meta server database must also be highly available

six   Meta server must also be HA deployed

seven   You should be able to expand the capacity of the meta server at any time

eight   To expand the data server at any time, the meta server must know the details of each data server (health status, disk remaining space, cpu utilization, memory utilization, network card traffic, etc.), and there is a heartbeat packet between the data server and the meta server

Cluster deployment of MinIO

You can't start a process like a stand-alone test. You must deploy in a cluster. If docker compose is used, you can start multiple minIO docker containers on a stand-alone machine, but you can't deploy minIO across physical hosts. If docker swarm or K8s is used, you can deploy minIO across physical hosts. The following composition.yaml file explains everything.

version: '3.7'

# Settings and configurations that are common for all containers
x-minio-common: &minio-common
  image: quay.io/minio/minio:RELEASE.2021-09-09T21-37-07Z
  command: server --console-address ":9001" http://minio{1...4}/data{1...2}
  expose:
    - "9000"
    - "9001"
  environment:
    MINIO_ROOT_USER: minio
    MINIO_ROOT_PASSWORD: minio123
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
    interval: 30s
    timeout: 20s
    retries: 3

# starts 4 docker containers running minio server instances.
# using nginx reverse proxy, load balancing, you can access
# it through port 9000.
services:
  minio1:
    <<: *minio-common
    hostname: minio1
    volumes:
      - data1-1:/data1
      - data1-2:/data2

  minio2:
    <<: *minio-common
    hostname: minio2
    volumes:
      - data2-1:/data1
      - data2-2:/data2

  minio3:
    <<: *minio-common
    hostname: minio3
    volumes:
      - data3-1:/data1
      - data3-2:/data2

  minio4:
    <<: *minio-common
    hostname: minio4
    volumes:
      - data4-1:/data1
      - data4-2:/data2

  nginx:
    image: nginx:1.19.2-alpine
    hostname: nginx
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    ports:
      - "9000:9000"
      - "9001:9001"
    depends_on:
      - minio1
      - minio2
      - minio3
      - minio4

## By default this config uses default local driver,
## For custom volumes replace with volume driver configuration.
volumes:
  data1-1:
  data1-2:
  data2-1:
  data2-2:
  data3-1:
  data3-2:
  data4-1:
  data4-2:

Start the distributed Minio instance with 8 nodes and 1 disk for each node. You need to run the following commands on all 8 nodes  

export MINIO_ACCESS_KEY=<ACCESS_KEY>
export MINIO_SECRET_KEY=<SECRET_KEY>
minio server http://192.168.1.11/export1 http://192.168.1.12/export2 \
               http://192.168.1.13/export3 http://192.168.1.14/export4 \
               http://192.168.1.15/export5 http://192.168.1.16/export6 \
               http://192.168.1.17/export7 http://192.168.1.18/export8

However, it is strange to see that minio does not have the role of meta server. Are all nodes peer-to-peer? It seems that the core is that it uses a so-called erasure code mechanism

Erasure code is a mathematical algorithm to recover lost and damaged data. Minio uses Reed Solomon code to split the object into N/2 data and N/2 parity blocks. This means that if there are 12 disks, an object will be divided into 6 data blocks and 6 parity blocks. You can lose any 6 disks (whether they are stored data blocks or parity blocks), and you can still recover the data from the remaining disks

The working principle of erasure code is different from RAID or replication. For example, RAID6 can not lose data when two disks are lost, while Minio erasure code can still ensure data security when half of the disks are lost. Moreover, the Minio erasure code is used at the object level to recover one object at a time, while RAID is used at the volume level, and the data recovery time is very long. Minio encodes each object separately. Once the storage service is deployed, it usually does not need to replace the hard disk or repair. The design goal of Minio erasure code is to improve performance and use hardware acceleration as much as possible.

  Bit attenuation, also known as Data Rot and Silent Data Corruption, is a serious data loss problem of hard disk data. The data on the hard disk may be damaged unknowingly, and there is no error log. As the saying goes, open guns are easy to hide and hidden arrows are difficult to prevent. This kind of secret mistake is more dangerous than the direct click of the hard disk. But don't be afraid, Minio erasure code adopts high speed   HighwayHash   Hash based checksum to prevent bit attenuation.

 

Posted by EriRyoutan on Sat, 20 Nov 2021 10:45:20 -0800