[notes] Apache pulsar learning manual

Keywords: message queue

1. Apache pulsar installation and deployment

1.1 preliminary preparation

  • zookeeper 3.4.5
  • pulsar installation package 2.8.1
  • Cluster security free environment

1.2 deployment steps

1.2.1 upload the installation package to the linux server

Download address: https://pulsar.apache.org/zh-CN/download/

1.2.2. Unzip the file to the data directory

tar -zxvf apache-pulsar-2.8.1-bin.tar.gz  -C /data/

1.2.3. Initialize cluster metadata information

Execute on risen-cdh01

bin/pulsar initialize-cluster-metadata \
  --cluster pulsar-cluster \
  --zookeeper risen-cdh01:2181 \
  --configuration-store risen-cdh01:2181 \
  --web-service-url http://risen-cdh01:8089 \
  --web-service-url-tls https://risen-cdh01:8443 \
  --broker-service-url pulsar://risen-cdh01:6650 \
  --broker-service-url-tls pulsar+ssl://risen-cdh01:6651

Successful execution

10:36:09.876 [main] INFO org.apache.bookkeeper.discover.ZKRegistrationManager - Successfully formatted BookKeeper metadata
10:36:09.880 [main] INFO org.apache.zookeeper.ZooKeeper - Session: 0x16734464b360002 closed
10:36:09.880 [main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x16734464b360002
10:36:10.033 [main] INFO org.apache.pulsar.PulsarClusterMetadataSetup - Cluster metadata for 'pulsar-cluster-1' setup correctly

If the execution fails, enter zkclient. Delete relevant files

[zookeeper, counters, bookies, ledgers, managed-ledgers, schemas, namespace, admin, loadbalance]

1.2.4. Modify Bookkeeper configuration file

vim conf/bookkeeper.conf

Amend the following:

zkServers=risen-cdh01:2181,risen-cdh02:2181,risen-cdh03:2181

ps: Port modification can be customized, but it cannot conflict with existing ports

1.2.5. Modify brokers configuration file

vim  conf/broker.conf

Amend the following:

zookeeperServers=risen-cdh01:2181,risen-cdh02:2181,risen-cdh03:2181
configurationStoreServers=risen-cdh01:2181,risen-cdh02:2181,risen-cdh03:2181
clusterName=pulsar-cluster

1.2.6. Modify all 8080 ports under the conf directory

Because the 8080 port is too commonly used, it is easy to be occupied

Adjust here to 8089

1.2.7 distribute the modified files to several other servers

scp -r apache-pulsar-2.8.1/ risen-cdh02:$PWD
scp -r apache-pulsar-2.8.1/ risen-cdh03:$PWD

1.2.8. Installing BookKeeper cluster

On three machines

bin/pulsar-daemon start bookie

close
bin/pulsar-daemon stop bookie

After execution, use the following command to see if the startup is successful

bin/bookkeeper shell bookiesanity

As shown in the figure above, it indicates that the startup is successful

1.2.9. Installing brokers cluster

On three machines

bin/pulsar-daemon start broker

close
bin/pulsar-daemon stop broker

Then execute on risen-cdh01

bin/pulsar-admin brokers list pulsar-cluster

As shown in the figure above, it indicates that the startup is successful

2. Pulsar Manager installation and deployment

2.1 preliminary preparation

  • pulsar cluster installation completed
  • The server has docker installed

2.2 installation steps

2.2.1. docker pulls the latest environment

docker pull apachepulsar/pulsar-manager:latest

2.2.2 operation

docker run -dit \
    -p 9527:9527 -p 7750:7750 \
    -e SPRING_CONFIGURATION_FILE=/pulsar-manager/pulsar-manager/application.properties \
    apachepulsar/pulsar-manager:latest

2.2.3. Create an account

CSRF_TOKEN=$(curl http://risen-cdh01:7750/pulsar-manager/csrf-token)
curl \
    -H "X-XSRF-TOKEN: $CSRF_TOKEN" \
    -H "Cookie: XSRF-TOKEN=$CSRF_TOKEN;" \
    -H 'Content-Type: application/json' \
    -X PUT http://risen-cdh01:7750/pulsar-manager/users/superuser \
    -d '{"name": "admin", "password": "apachepulsar", "description": "test", "email": "username@test.org"}'

2.2.4. Querying a cluster

The pulsar admin api called by pulsar manager needs to get information from the broker, so you need to specify the broker url for pulsar admin to get information.

bin/pulsar-admin clusters list

2.2.5. Specify cluster

bin/pulsar-admin clusters update pulsar-cluster --url http://192.168.5.213:8089

2.2.6 login query

visit http://risen-cdh01:9527

Log in to the account and password just set in 2.2.3

Installation completed!

3. Introduction to Pulsar concept

3.1. Functions and characteristics

3.1.1 multi tenant

Aim to isolate resources and configure different resources for each user. User A can only operate 20% of resources, and user B can operate 30% of resources (tenants are used in conjunction with namespace operations)

Tenant and namespace are pulsar Two core concepts supporting multi tenancy.
At the tenant level, pulsar Reserve appropriate storage space, application authorization and authentication mechanism for specific tenants
 At the namespace level, pulsar There are a series of configuration policies. Including quota, flow control, message expiration policy and isolation policy between namespaces

3.1.2 flexible message system

  • For the unification of queue model and flow model, only one piece of data needs to be saved at the Topic level, and the same piece of data can be consumed multiple times. Computing different subscription models by streaming and queue greatly improves the flexibility
  • At the same time, exact once is adopted through transactions to ensure that data is not lost or repeated in the process of message transmission
  • The flow model can be carried out with pulsar function, stream ETL from several topics, and then write it to another topic

3.1.3 cloud native architecture

  • Cloud native architecture with separation of computing and storage. The data is moved away from the broker, and there is an internal bookkeeper for shared storage
  • The upper layer broker is stateless and is responsible for data distribution and service
  • The lower layer is the persistent storage layer, Bookie.
  • pulsar storage is segmented to avoid being limited during capacity expansion and realize independent expansion and rapid recovery of data

3.1.4 segmented streams

  • The unbounded data is viewed as a stream of components, which are stored in a hierarchical storage Bookkeeper cluster and broker nodes.

3.1.5 support cross region replication

  • Cross cluster and cross region disaster recovery can be realized

3.2 components provided by Pulsar

3.2.1 hierarchical storage

  • bookkeeper storage. When there is too much data, the reading efficiency decreases. You can put part of the data in other places (offload fragmentation), such as hdfs or other

3.2.2 Pulsar IO (Connector) Connector

  • The main purpose is to integrate pulsar with other surrounding software.
  • There are two components, source and sink
  • For example, HDFS, spark, Flink, Flume, ES, HBase

3.2.3. Pulsar Functions (lightweight computing framework)

  • Provide users with a FASS platform with simple deployment / API / operation and maintenance
  • Carry out some flow calculation.
  • Similar to kafka Stream

3.3 difference between kafka and kafka

3.3.1 conceptual model

  • Kafka: producer → topic → consumer group → consumer
  • Pulsar: producer → topic → subsciption → consumer

In kafka, there is a consumer group. Consumers in the consumer group can only consume data in a partition of topic

In pulsar, it is the publish subscribe mode (sub), which can make its own strategies to consume. For example, every consumer can consume all data

3.3.2 message consumption mode

  • Kafka: it mainly focuses on the Stream mode. It is exclusive consumption in a single partition, and there is no Queue consumption mode
  • Pulsar: it provides a unified consumption model and API, and can freely set whether it is one-to-one, exclusive or failover

3.3.3 message acknowledgement (ack)

  • kafka uses offset
  • pulsar has special cursor management to ensure accurate one-time consumption!

3.3.4 message retention

  • kakfa: you can specify a data retention policy when creating a topic. The default is 7 days. TTL is not supported when the expiration date is deleted directly regardless of consumption
  • pulsar: all subscribers will not be deleted until they consume it, and data will not be lost. You can also set the retention period to retain the consumed data and support TTL (how long is it valid)

3.3.5 comparison and summary

  • pulsar is much faster than kafka and occupies less resources

3.4 interpretation of common terms

  • Messages: messages are the basic "unit" of Pulsar. The message refers to the content published by the producer to the topic, and also refers to the content consumed by the consumer from the topic (and sends a confirmation after the message processing is completed). Messages are similar to letters in the postal service system.

  • Producers: producers are programs that connect topic and publish messages to a Pulsar broker.

  • Sending mode: synchronous or async hronous

  • Consumers: the Consumer sends a message flow acquisition request to the broker to obtain a message. There is a queue on the Consumer side to receive messages pushed from the broker. The queue size can be configured through receiverQueueSize (default: 1000). Whenever consumer.receive() is called once, a message is obtained from the buffer.

  • Receiving mode: synchronous receiving (sync) or asynchronous receiving (async)

  • Listening: in this interface, once a new message is received, the received method will be called.

  • Confirmation: when the consumer successfully consumes a message, it will send an acknowledgement request to the broker. Messages are deleted only after all subscriptions have been confirmed. Before that, messages are permanently saved. If you want the message to remain after being confirmed by the consumer, you can configure the message retention policy implementation.

  • Topic: topic in Pulsar is a named channel used to transmit messages from producer to consumer. The name of the topic is a well structured URL: {persistent | non persistent} 😕/ tenant/namespace/topic

  • Namespace: a namespace is a logical naming term within a tenant. A tenant can create multiple namespaces through the admin API. For example, a tenant with multiple applications can create a separate namespace for each application. Namespace enables programs to create and manage topic topic my tenant / app1 in a hierarchical manner. Its namespace is app1, and the corresponding tenant is my tenant. You can create any number of topics in the namespace.

  • Subscriptions: subscriptions are named configuration rules that guide how messages are delivered to consumers. There are four subscription modes available in Pulsar: exclusive, shared, failover, and key_shared.

  • Multi topic subscription: Pulsar consumers can subscribe to multiple topics at the same time

4. Pulsar architecture

Core: separation of computing and storage

4.1 composition of single Pulsar cluster

  • Multiple brokers are responsible for processing and load balancing messages sent by the producer (avoiding data skewing to one broker), and dispatching these messages to consumer s
  • broker and pulsar are configured to handle the corresponding tasks and store messages in BookKeeper (Books) instances
  • broker relies on the zookeeper cluster to handle specific tasks
  • The bookkeeper cluster of multiple bookie s is responsible for the persistent storage of messages
  • A zookeeper cluster is used to handle coordination tasks among multiple pulsar clusters

4.2,broker

  • Stateless component, mainly responsible for running the other two components:
  • HTTP server, the default port is 8080, and the upper deployment is 8089. It exposes the REST system management interface and the Topic Search API between producers and consumers
  • Scheduling distributor, port 6550, one-step TCP server, applied to data transmission through binary protocol

The broker will dispatch the data from the Managed Ledger cache to the consumer. When the backlog exceeds the cache size, it will start to send the data to the Bookkeeper

4.3,zookeeper

pulsar uses zk for source data storage, cluster configuration and coordination

Configuration storage: stores tenants, namespaces, and other configuration items that need to be globally consistent

4.4,bookkeeper

Persistent storage container is a distributed pre write (WAL)

Refer to the official website document for features

4.5,pulsar proxy

Provide a gateway for all brokers. When direct connection is not possible, you can communicate with brokers through proxy

5. Introduction to Pulsar operation

5.1 pulsar admin operation namespace command

5.1.1. Create namespaces for designated tenants

pulsar-admin namespaces create test-tenant/test-namespace

5.1.2 list all namespaces under the tenant

pulsar-admin namespaces list test-tenant

5.1.3. Delete existing namespaces under the tenant

pulsar-admin namespaces delete test-tenant/ns1

5.1.4 setting backlog quota policy

pulsar-admin namespaces set-backlog-quota --limit 10--policy producer_request_hold test-tenant/ns1

5.1.5. View backlog quota policy

pulsar-admin namespaces get-backlog-quotas test-tenant/ns1

5.1.6. Remove backlog quota policy

pulsar-admin namespaces remove-backlog-quota test-tenant/ns1

5.1.7 setting persistence policy

  • Bookkeeper ack quorum: the number of acks (guaranteed copies) waiting for each entry. The default value is 0

  • Bookkeeper ensembles: the number of bookie s used by a single topic. Default: 0

  • Bookkeeper write quorum: the number of times to write to each entry. The default value is 0

  • Ml mark delete Max rate: limit rate of mark delete operation (0 means unlimited). Default value: 0.0

    pulsar-admin namespaces set-persistence --bookkeeper-ack-quorum 2–bookkeeper-ensemble 3–bookkeeper-write-quorum 2–ml-mark-delete-max-rate 0 test-tenant/ns1

5.1.8. Obtain persistence strategy

pulsar-admin namespaces get-persistence test-tenant/ns1

5.1.9. Uninstall namespace

pulsar-admin namespaces unload --bundle 0x00000000_0xffffffff test-tenant/ns1

5.1.10 clear message accumulation

pulsar-admin namespaces clear-backlog --submy-subscription test-tenant/ns1

5.1.11 setting message retention parameters

The namespace contains multiple topics. The reserved size (storage size) of each topic should not exceed a specific threshold, otherwise its storage time will be limited. You can configure the retention size and retention time of topic in the specified namespace through the following commands.

pulsar-admin set-retention --size 10--time 100 test-tenant/ns1

5.1.12. Set message distribution rate

Set the message dispatch rate for all topic s in the given namespace. The dispatch rate is limited by MSG dispatch rate or byte dispatch rate. Dispatch rate refers to the number of messages dispatched per second, which can be configured through dispatch rate period. The default values of MSG dispatch rate and byte dispatch rate are - 1, that is, quota restrictions are disabled.

pulsar-admin namespaces set-dispatch-rate test-tenant/ns1 \
--msg-dispatch-rate 1000 \
--byte-dispatch-rate 1048576 \
--dispatch-rate-period 1

5.1.13. Get message distribution rate configuration

Messages sent / sec

pulsar-admin namespaces get-dispatch-rate test-tenant/ns1

5.2 pulsar admin operation Tenants command

5.2.1. Obtain resource list

pulsar-admin tenants list

5.2.2. Creating tenants

pulsar-admin tenants list Tenant name

5.2.3 delete tenant

pulsar-admin tenants delete Tenant name

5.3 pulsar admin operation Topic command

5.3.1 list all persistent topic s under the specified namespace

pulsar-admin persistent list my-tenant/my-namespace

5.3.2. Authorize the client user to perform some operations on the specified topic

pulsar-admin persistent grant-permission \
  --actions produce,consume --role application1 \
  persistent://test-tenant/ns1/tp1 \

5.3.3 obtaining permission

pulsar-admin persistent permissions \
  persistent://test-tenant/ns1/tp1 \
{
    "application1": [
        "consume",
        "produce"
    ]
}

5.3.4. Cancel permission

pulsar-admin persistent revoke-permission \
  --role application1 \
  persistent://test-tenant/ns1/tp1 \
{
  "application1": [
    "consume",
    "produce"
  ]
}

5.3.5. Delete topic

pulsar-admin persistent delete  persistent://test-tenant/ns1/tp1 

5.3.6. Uninstall topic under this namespace

pulsar-admin persistent unload   persistent://test-tenant/ns1/tp1

5.3.7. View 10 pieces of data in topic

pulsar-admin persistent peek-messages \
  --count 10 --subscription my-subscription \
  persistent://test-tenant/ns1/tp1

5.3.8. Create topic

Note: no matter whether there is a partition or not, if there is no operation within 60s after the topic is created, the topic will be considered inactive and deleted

Relevant parameters:

Brokerdeleteinactivetopicsenabenabled: The default value is true Indicates whether to start the automatic deletion function
BrokerDeleteInactiveTopicsFrequencySeconds: Default 60 s
  • Create topic without partition

    pulsar-admin topics create persistent://my-tenant/my-namespace/mytopic

  • Create topic with partition

    pulsar-admin topics create-partitioned-copic persistent://my-tenant/my-namespace/mytopic --partitions 5

5.3.9. Which broker is used to query topic

pulsar-admin topics lookup persistent://my-tenant/my-namespace/mytopic

Posted by steveswt on Wed, 01 Dec 2021 02:49:29 -0800