1, Introduction
RocketMQ is Alibaba's open source distributed messaging middleware. It draws on Kafka's implementation and supports functions such as message subscription and publishing, sequential message, transaction message, timing message, message backtracking, dead letter queue and so on. RocketMQ architecture is mainly divided into four parts, as shown in the following figure:
-
Producer: Message producer, which supports distributed cluster deployment.
-
Consumer: message consumer, which supports distributed cluster deployment.
-
NameServer: NameServer is a very simple Topic routing registry, which supports dynamic registration and discovery of brokers. Producer s and consumers dynamically perceive the routing information of brokers through NameServer.
-
Broker: broker is mainly responsible for message storage, forwarding and query.
Based on Apache rocketmq version 4.9.1, this paper analyzes how the message storage module in Broker is designed.
2, Storage architecture
The message file path of RocketMQ is shown in the figure.
CommitLog
The message body and metadata storage body store the message body content written by the Producer side. The message content is not fixed length. The default size of a single file is 1G, the length of the file name is 20 bits, the left is filled with zero, and the rest is the starting offset. For example, 00000000000000000000 represents the first file, the starting offset is 0, and the file size is 1G=1073741824; When the first file is full, the second file is 0000000000 1073741824, the starting offset is 1073741824, and so on.
ConsumeQueue
For message consumption queue, the ConsumeQueue file can be regarded as an index file based on CommitLog. The ConsumeQueue file adopts a fixed length design. Each entry has a total of 20 bytes, including 8-byte CommitLog physical offset, 4-byte message length and 8-byte tag hashcode. A single file consists of 30W entries. Each entry can be accessed randomly like an array. The size of each ConsumeQueue file is about 5.72M.
IndexFile
The index file provides a method to query messages through key or time interval. The size of a single IndexFile is about 400M. An IndexFile can store 2000W indexes. The underlying storage design of IndexFile is similar to the HashMap data structure of JDK.
Other files: including config folder to store runtime configuration information; abort file, indicating whether the Broker is closed normally; The checkpoint file stores the timestamp of the last disk swiping of Commitlog, ConsumeQueue and Index files. These are beyond the scope of this article.
Compared with Kafka, each partition of each Topic in Kafka corresponds to a file, which is written in sequence and flushed regularly. However, once there are too many topics in a single Broker, sequential writes will degenerate into random writes. All topics of a single RocketMQ Broker are written sequentially in the same CommitLog, which can ensure strict sequential writing. RocketMQ needs to get the actual physical offset of the message from ConsumeQueue before reading the message content from CommitLog, which will cause random reading.
2.1 Page Cache and mmap
Before formally introducing the implementation of Broker message storage module, first explain the two concepts of Page Cache and mmap.
Page Cache is the OS Cache of files, which is used to speed up the reading and writing of files. Generally speaking, the speed of sequential reading and writing of files by the program is almost close to that of memory. The main reason is that the OS uses the page Cache mechanism to optimize the performance of reading and writing access operations, and uses part of the memory as page Cache. For data writing, the OS will write to the Cache first, and then the pdflush kernel thread will brush the data in the Cache to the physical disk asynchronously. For data reading, if the page Cache is missed when reading a file, the OS will pre read the data files of other adjacent blocks while accessing the read file from the physical disk.
mmap maps the physical files on the disk directly to the memory address of the user state, which reduces the performance overhead of traditional IO copying the disk file data back and forth between the buffer of the operating system kernel address space and the buffer of the user application address space. FileChannel in Java NIO provides a map() method to implement mmap. The read-write performance of FileChannel and mmap can be compared by reference This article.
2.2 Broker module
The following figure is the Broker storage architecture diagram, showing the business flow process of the Broker module from receiving a message to returning a response.
Service access layer: RocketMQ implements the underlying communication based on the Reactor multithreading model of Netty. The Reactor main process pool eventLoopGroupBoss is responsible for creating TCP connections. By default, there is only one thread. After the connection is established, it is sent to the Reactor sub thread pool eventLoopGroupSelector for reading and writing events.
defaultEventExecutorGroup is responsible for SSL authentication, encoding and decoding, idle check and network connection management. Then, according to the business request code of RomotingCommand, find the corresponding processor in the local cache variable processorTable, encapsulate it into a task task, and submit it to the corresponding business processor processing thread pool for execution. The Broker module improves the system throughput through the four level thread pool.
Business processing layer: processes various business requests called through RPC, including:
-
SendMessageProcessor is responsible for processing the request of Producer to send messages;
-
PullMessageProcessor is responsible for processing the request of Consumer consumption message;
-
QueryMessageProcessor is responsible for processing requests to query messages according to message keys.
Storage logic layer: DefaultMessageStore is the core storage logic class of RocketMQ, which provides message storage, reading, deletion and other capabilities.
File mapping layer: map Commitlog, ConsumeQueue and IndexFile files to storage object MappedFile.
Data transport layer: it supports reading and writing messages based on mmap memory mapping, as well as reading and writing messages based on mmap and writing messages to out of heap memory.
The following chapters will analyze how RocketMQ implements high-performance storage from the perspective of source code.
3, Message writing
Taking single message production as an example, the message writing timing logic is shown in the figure below, and the business logic flows between layers as shown in the Broker storage architecture above.
The bottom message is written to the core code. In the asyncPutMessage method of CommitLog, it is mainly divided into three steps: obtaining MappedFile, writing message to buffer and submitting disk flushing request. It should be noted that there are spin locks or ReentrantLock locks before and after these three steps to ensure that the messages written by a single Broker are serial.
//org.apache.rocketmq.store.CommitLog::asyncPutMessage public CompletableFuture<PutMessageResult> asyncPutMessage(final MessageExtBrokerInner msg) { ... putMessageLock.lock(); //spin or ReentrantLock ,depending on store config try { //Get the latest MappedFile MappedFile mappedFile = this.mappedFileQueue.getLastMappedFile(); ... //Write message to buffer result = mappedFile.appendMessage(msg, this.appendMessageCallback, putMessageContext); ... //Submit disk brushing request CompletableFuture<PutMessageStatus> flushResultFuture = submitFlushRequest(result, msg); ... } finally { putMessageLock.unlock(); } ... }
Here's what these three steps do.
3.1 MappedFile initialization
The AllocateMappedFileService asynchronous thread that manages MappedFile creation is started when the Broker is initialized. The message processing thread and the AllocateMappedFileService thread are associated through the queue requestQueue.
When writing a message, call the putRequestAndReturnMappedFile method of AllocateMappedFileService to put the submit MappedFile creation request into the requestQueue. Here, two allocaterequests will be built and put into the queue at the same time.
The AllocateMappedFileService thread loops to get the AllocateRequest from the requestQueue to create a MappedFile. The message processing thread waits for the first MappedFile to be obtained through CountDownLatch, and returns when the creation is successful.
When the message processing thread needs to create the MappedFile again, it can directly obtain the previously pre created MappedFile. This reduces the waiting time for file creation by pre creating mappedfiles.
//org.apache.rocketmq.store.AllocateMappedFileService::putRequestAndReturnMappedFile public MappedFile putRequestAndReturnMappedFile(String nextFilePath, String nextNextFilePath, int fileSize) { //Request to create MappedFile AllocateRequest nextReq = new AllocateRequest(nextFilePath, fileSize); boolean nextPutOK = this.requestTable.putIfAbsent(nextFilePath, nextReq) == null; ... //Request to pre create the next MappedFile AllocateRequest nextNextReq = new AllocateRequest(nextNextFilePath, fileSize); boolean nextNextPutOK = this.requestTable.putIfAbsent(nextNextFilePath, nextNextReq) == null; ... //Get MappedFile created this time AllocateRequest result = this.requestTable.get(nextFilePath); ... } //org.apache.rocketmq.store.AllocateMappedFileService::run public void run() { .. while (!this.isStopped() && this.mmapOperation()) { } ... } //org.apache.rocketmq.store.AllocateMappedFileService::mmapOperation private boolean mmapOperation() { ... //Get AllocateRequest from queue req = this.requestQueue.take(); ... //Determine whether to open the off heap memory pool if (messageStore.getMessageStoreConfig().isTransientStorePoolEnable()) { //Open MappedFile of off heap memory mappedFile = ServiceLoader.load(MappedFile.class).iterator().next(); mappedFile.init(req.getFilePath(), req.getFileSize(), messageStore.getTransientStorePool()); } else { //Normal MappedFile mappedFile = new MappedFile(req.getFilePath(), req.getFileSize()); } ... //MappedFile preheating if (mappedFile.getFileSize() >= this.messageStore.getMessageStoreConfig() .getMappedFileSizeCommitLog() && this.messageStore.getMessageStoreConfig().isWarmMapedFileEnable()) { mappedFile.warmMappedFile(this.messageStore.getMessageStoreConfig().getFlushDiskType(), this.messageStore.getMessageStoreConfig().getFlushLeastPagesWhenWarmMapedFile()); } req.setMappedFile(mappedFile); ... }
Each time a new ordinary MappedFile request is created, a mappedByteBuffer will be created. The following code shows how Java mmap is implemented.
//org.apache.rocketmq.store.MappedFile::init private void init(final String fileName, final int fileSize) throws IOException { ... this.fileChannel = new RandomAccessFile(this.file, "rw").getChannel(); this.mappedByteBuffer = this.fileChannel.map(MapMode.READ_WRITE, 0, fileSize); ... }
If off heap memory is enabled, that is, when transientStorePoolEnable = true, mappedByteBuffer is only used to read messages, and off heap memory is used to write messages, so as to realize the separation of reading and writing messages. The out of heap memory object does not need to be created every time a MappedFile is created, but is initialized according to the size of the out of heap memory pool at system startup. Each out of heap memory DirectByteBuffer has the same size as the CommitLog file. By locking the out of heap memory, it is ensured that it will not be replaced into virtual memory.
//org.apache.rocketmq.store.TransientStorePool public void init() { for (int i = 0; i < poolSize; i++) { //Allocate out of heap memory of the same size as the CommitLog file ByteBuffer byteBuffer = ByteBuffer.allocateDirect(fileSize); final long address = ((DirectBuffer) byteBuffer).address(); Pointer pointer = new Pointer(address); //Lock out of heap memory to ensure that it will not be replaced into virtual memory LibC.INSTANCE.mlock(pointer, new NativeLong(fileSize)); availableBuffers.offer(byteBuffer); } }
There is a MappedFile preheating logic in the mmapOperation method above. Why do I need to preheat files? How to preheat files?
Because the mmap mapping only establishes the mapping relationship between the process virtual memory address and the physical memory address, and does not load the Page Cache into memory. When reading and writing data, if the Page Cache is not hit, a page missing interrupt occurs, and the data is reloaded from the disk to memory, which will affect the reading and writing performance. In order to prevent page missing exceptions and prevent the operating system from scheduling related memory pages to the swap space, RocketMQ preheats files as follows.
//org.apache.rocketmq.store.MappedFile::warmMappedFile public void warmMappedFile(FlushDiskType type, int pages) { ByteBuffer byteBuffer = this.mappedByteBuffer.slice(); int flush = 0; //The operating system allocates physical memory space by writing 1G byte 0. If there is no fill value, the operating system will not actually allocate physical memory to prevent page missing exceptions when writing messages for (int i = 0, j = 0; i < this.fileSize; i += MappedFile.OS_PAGE_SIZE, j++) { byteBuffer.put(i, (byte) 0); // force flush when flush disk type is sync if (type == FlushDiskType.SYNC_FLUSH) { if ((i / OS_PAGE_SIZE) - (flush / OS_PAGE_SIZE) >= pages) { flush = i; mappedByteBuffer.force(); } } //prevent gc if (j % 1000 == 0) { Thread.sleep(0); } } //force flush when prepare load finished if (type == FlushDiskType.SYNC_FLUSH) { mappedByteBuffer.force(); } ... this.mlock(); } //org.apache.rocketmq.store.MappedFile::mlock public void mlock() { final long beginTime = System.currentTimeMillis(); final long address = ((DirectBuffer) (this.mappedByteBuffer)).address(); Pointer pointer = new Pointer(address); //Lock the Page Cache of the file through the system call mlock to prevent it from being exchanged to the swap space int ret = LibC.INSTANCE.mlock(pointer, new NativeLong(this.fileSize)); //The system calls madwise to advise the operating system that the file will be accessed in the near future int ret = LibC.INSTANCE.madvise(pointer, new NativeLong(this.fileSize), LibC.MADV_WILLNEED); }
To sum up, RocketMQ pre creates a file every time to reduce the file creation delay, and avoids the page missing exception during reading and writing through file preheating.
3.2 message writing
3.2.1 write CommitLog
The logical view of each message storage in CommitLog is shown in the following figure. TOTALSIZE is the storage space occupied by the whole message.
The following table describes which fields are included in each message, as well as the space occupied by these fields and field introduction.
The message is written by calling the appendMessagesInner method of MappedFile.
//org.apache.rocketmq.store.MappedFile::appendMessagesInner public AppendMessageResult appendMessagesInner(final MessageExt messageExt, final AppendMessageCallback cb, PutMessageContext putMessageContext) { //Determine whether to use DirectBuffer or MappedByteBuffer for write operation ByteBuffer byteBuffer = writeBuffer != null ? writeBuffer.slice() : this.mappedByteBuffer.slice(); .. byteBuffer.position(currentPos); AppendMessageResult result = cb.doAppend(this.getFileFromOffset(), byteBuffer, this.fileSize - currentPos, (MessageExtBrokerInner) messageExt, putMessageContext); .. return result; } //org.apache.rocketmq.store.CommitLog::doAppend public AppendMessageResult doAppend(final long fileFromOffset, final ByteBuffer byteBuffer, final int maxBlank, final MessageExtBrokerInner msgInner, PutMessageContext putMessageContext) { ... ByteBuffer preEncodeBuffer = msgInner.getEncodedBuff(); ... //The message is only written to the buffer, and the disk has not been actually flushed byteBuffer.put(preEncodeBuffer); msgInner.setEncodedBuff(null); ... return result; }
So far, the message is finally written to ByteBuffer and has not been persisted to disk. When will it be persisted? The next section will talk about the disk brushing mechanism. Here's a question: how are ConsumeQueue and IndexFile written?
The answer is to store the ReputMessageService of the logical layer in the storage architecture diagram. When the MessageStore is initialized, it will start a ReputMessageService asynchronous thread. After it is started, it will continuously call the doReput method in the loop to notify ConsumeQueue and IndexFile to update. The reason why ConsumeQueue and IndexFile can be updated asynchronously is that the CommitLog stores the queue, Topic and other information required to recover ConsumeQueue and IndexFile. Even if the Broker service is abnormally down, the Broker can recover ConsumeQueue and IndexFile according to the CommitLog after restart.
//org.apache.rocketmq.store.DefaultMessageStore.ReputMessageService::run public void run() { ... while (!this.isStopped()) { Thread.sleep(1); this.doReput(); } ... } //org.apache.rocketmq.store.DefaultMessageStore.ReputMessageService::doReput private void doReput() { ... //Get new messages stored in CommitLog DispatchRequest dispatchRequest = DefaultMessageStore.this.commitLog.checkMessageAndReturnSize(result.getByteBuffer(), false, false); int size = dispatchRequest.getBufferSize() == -1 ? dispatchRequest.getMsgSize() : dispatchRequest.getBufferSize(); if (dispatchRequest.isSuccess()) { if (size > 0) { //If there is a new message, call commitlogdispatcher buildconsumequeue and commitlogdispatcher buildindex to build consummequeue and IndexFile respectively DefaultMessageStore.this.doDispatch(dispatchRequest); } ... }
3.2.2 write ConsumeQueue
As shown in the figure below, each record of ConsumeQueue has 20 bytes in total, including 8-byte CommitLog physical offset, 4-byte message length and 8-byte tag hashcode.
The persistence logic of ConsumeQueue records is as follows.
//org.apache.rocketmq.store.ConsumeQueue::putMessagePositionInfo private boolean putMessagePositionInfo(final long offset, final int size, final long tagsCode, final long cqOffset) { ... this.byteBufferIndex.flip(); this.byteBufferIndex.limit(CQ_STORE_UNIT_SIZE); this.byteBufferIndex.putLong(offset); this.byteBufferIndex.putInt(size); this.byteBufferIndex.putLong(tagsCode); final long expectLogicOffset = cqOffset * CQ_STORE_UNIT_SIZE; MappedFile mappedFile = this.mappedFileQueue.getLastMappedFile(expectLogicOffset); if (mappedFile != null) { ... return mappedFile.appendMessage(this.byteBufferIndex.array()); } }
3.2.3 writing IndexFile
The logical structure of IndexFile is shown in the following figure, which is similar to the array and linked list structure of HashMap in JDK. It is mainly composed of Header, Slot Table and Index Linked List.
Header: the header of IndexFile, accounting for 40 bytes. It mainly includes the following fields:
-
beginTimestamp: the minimum storage time of messages contained in the IndexFile.
-
endTimestamp: the maximum storage time of messages contained in the IndexFile file.
-
beginPhyoffset: the minimum CommitLog file offset of the message contained in the IndexFile.
-
endPhyoffset: the maximum CommitLog file offset of the message contained in the IndexFile file.
-
hashSlot count: the total number of hashslots contained in the IndexFile.
-
indexCount: the number of Index entries used in the IndexFile.
Slot Table: contains 500w hash slots by default. Each hash slot stores the first IndexItem storage location of the same hash value.
Index Linked List: up to 2000w indexitems by default. Its composition is as follows:
-
Key Hash: the hash of the message key. When searching according to the key, the hash is compared, and then the key itself is compared.
-
CommitLog Offset: the physical displacement of the message.
-
Timestamp: the difference between the message storage time and the timestamp of the first message.
-
Next Index Offset: the location of the next IndexItem saved after a hash conflict.
Each hash slot in the Slot Table stores the position of IndexItem in the Index Linked List. In case of hash conflict, the new IndexItem is inserted into the chain header, and its Next Index Offset stores the previous IndexItem position in the chain header. At the same time, the hash slot in the Slot Table is overwritten as the latest IndexItem position. The code is as follows:
//org.apache.rocketmq.store.index.IndexFile::putKey public boolean putKey(final String key, final long phyOffset, final long storeTimestamp) { int keyHash = indexKeyHashMethod(key); int slotPos = keyHash % this.hashSlotNum; int absSlotPos = IndexHeader.INDEX_HEADER_SIZE + slotPos * hashSlotSize; ... //Get the current latest message location from Slot Table int slotValue = this.mappedByteBuffer.getInt(absSlotPos); ... int absIndexPos = IndexHeader.INDEX_HEADER_SIZE + this.hashSlotNum * hashSlotSize + this.indexHeader.getIndexCount() * indexSize; this.mappedByteBuffer.putInt(absIndexPos, keyHash); this.mappedByteBuffer.putLong(absIndexPos + 4, phyOffset); this.mappedByteBuffer.putInt(absIndexPos + 4 + 8, (int) timeDiff); //IndexItem position of chain header before storage this.mappedByteBuffer.putInt(absIndexPos + 4 + 8 + 4, slotValue); //Update the value of the hash slot in the Slot Table to the latest message location this.mappedByteBuffer.putInt(absSlotPos, this.indexHeader.getIndexCount()); if (this.indexHeader.getIndexCount() <= 1) { this.indexHeader.setBeginPhyOffset(phyOffset); this.indexHeader.setBeginTimestamp(storeTimestamp); } if (invalidIndex == slotValue) { this.indexHeader.incHashSlotCount(); } this.indexHeader.incIndexCount(); this.indexHeader.setEndPhyOffset(phyOffset); this.indexHeader.setEndTimestamp(storeTimestamp); return true; ... }
To sum up, a complete message writing process includes synchronous writing to the Commitlog file cache and asynchronous construction of ConsumeQueue and IndexFile files.
3.3 message disk brushing
RocketMQ message disk brushing mainly includes synchronous disk brushing and asynchronous disk brushing.
(1) Synchronous disk brushing: only after the message is really persisted to the disk, the Broker side of RocketMQ will really return a successful ACK response to the Producer side. Synchronous disk brushing is a good guarantee for the reliability of MQ messages, but it will have a great impact on performance. This mode is widely used in general financial services.
(2) Asynchronous disk brushing: it can take full advantage of the Page Cache of the OS. As long as the message is written to the Page Cache, the successful ACK can be returned to the Producer. Message disk brushing is carried out by background asynchronous thread submission, which reduces the read-write delay and improves the performance and throughput of MQ. Asynchronous disk brushing includes two ways: opening out of heap memory and not opening out of heap memory.
When submitting a disk brushing request in CommitLog, you will decide whether to brush the disk synchronously or asynchronously according to the current Broker configuration.
//org.apache.rocketmq.store.CommitLog::submitFlushRequest public CompletableFuture<PutMessageStatus> submitFlushRequest(AppendMessageResult result, MessageExt messageExt) { //Synchronous brush disc if (FlushDiskType.SYNC_FLUSH == this.defaultMessageStore.getMessageStoreConfig().getFlushDiskType()) { final GroupCommitService service = (GroupCommitService) this.flushCommitLogService; if (messageExt.isWaitStoreMsgOK()) { GroupCommitRequest request = new GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes(), this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout()); service.putRequest(request); return request.future(); } else { service.wakeup(); return CompletableFuture.completedFuture(PutMessageStatus.PUT_OK); } } //Asynchronous brush disk else { if (!this.defaultMessageStore.getMessageStoreConfig().isTransientStorePoolEnable()) { flushCommitLogService.wakeup(); } else { //Enable asynchronous disk flushing of off heap memory commitLogService.wakeup(); } return CompletableFuture.completedFuture(PutMessageStatus.PUT_OK); } }
The inheritance relationship of GroupCommitService, FlushRealTimeService and CommitRealTimeService is shown in the figure;
GroupCommitService: synchronize the disk brushing thread. As shown in the following figure, after the message is written to the Page Cache, the disk flushing is synchronized through GroupCommitService, and the message processing thread is blocked waiting for the disk flushing result.
//org.apache.rocketmq.store.CommitLog.GroupCommitService::run public void run() { ... while (!this.isStopped()) { this.waitForRunning(10); this.doCommit(); } ... } //org.apache.rocketmq.store.CommitLog.GroupCommitService::doCommit private void doCommit() { ... for (GroupCommitRequest req : this.requestsRead) { boolean flushOK = CommitLog.this.mappedFileQueue.getFlushedWhere() >= req.getNextOffset(); for (int i = 0; i < 2 && !flushOK; i++) { CommitLog.this.mappedFileQueue.flush(0); flushOK = CommitLog.this.mappedFileQueue.getFlushedWhere() >= req.getNextOffset(); } //Wake up the message processing thread waiting for the completion of disk brushing req.wakeupCustomer(flushOK ? PutMessageStatus.PUT_OK : PutMessageStatus.FLUSH_DISK_TIMEOUT); } ... } //org.apache.rocketmq.store.MappedFile::flush public int flush(final int flushLeastPages) { if (this.isAbleToFlush(flushLeastPages)) { ... //When writeBuffer is used or the position of fileChannel is not 0, fileChannel is used for forced disk flushing if (writeBuffer != null || this.fileChannel.position() != 0) { this.fileChannel.force(false); } else { //Forced disk brushing using MappedByteBuffer this.mappedByteBuffer.force(); } ... } }
FlushRealTimeService: asynchronous disk flushing thread of out of heap memory is not enabled. As shown in the following figure, after the message is written to the Page Cache, the message processing thread immediately returns and asynchronously swipes the disk through the FlushRealTimeService.
//org.apache.rocketmq.store.CommitLog.FlushRealTimeService public void run() { ... //Judge whether it is necessary to brush the disc periodically if (flushCommitLogTimed) { //Fixed sleep interval Thread.sleep(interval); } else { // If awakened, brush the disc, non periodic brush the disc this.waitForRunning(interval); } ... // The same forced disk brushing method is used here as GroupCommitService CommitLog.this.mappedFileQueue.flush(flushPhysicQueueLeastPages); ... }
CommitRealTimeService: start the asynchronous disk flushing thread of off heap memory. As shown in the figure below, the message processing thread returns immediately after writing the message to the off heap memory. Subsequently, the message is asynchronously submitted to the Page Cache from the out of heap memory through CommitRealTimeService, and then the FlushRealTimeService thread brushes the disk asynchronously.
Note: after the message is asynchronously submitted to the Page Cache, the business can read the message from the MappedByteBuffer.
After the message is written to the writeBuffer in the out of heap memory, the isAbleToCommit method will be used to determine whether it has accumulated to at least the number of committed pages (4 pages by default). If the number of pages reaches the minimum number of submitted pages, batch submission is required; Otherwise, it still resides in off heap memory, and there is a risk of message loss. Through this batch operation, the read and write Page Cahe will be separated by several pages, which reduces the probability of Page Cahe read-write conflict and realizes the separation of read and write. The specific implementation logic is as follows:
//org.apache.rocketmq.store.CommitLog.CommitRealTimeService class CommitRealTimeService extends FlushCommitLogService { @Override public void run() { while (!this.isStopped()) { ... int commitDataLeastPages = CommitLog.this.defaultMessageStore.getMessageStoreConfig().getCommitCommitLogLeastPages(); ... //Commit the message to the memory buffer, and finally call the MappedFile::commit0 method. Only when the minimum number of submitted pages is reached can the message be submitted successfully, otherwise it is still in the memory outside the heap boolean result = CommitLog.this.mappedFileQueue.commit(commitDataLeastPages); if (!result) { //Wake up the flushCommitLogService to force disk brushing flushCommitLogService.wakeup(); } ... this.waitForRunning(interval); } } } //org.apache.rocketmq.store.MappedFile::commit0 protected void commit0() { int writePos = this.wrotePosition.get(); int lastCommittedPosition = this.committedPosition.get(); //The message is submitted to the Page Cache without actually flushing the disk if (writePos - lastCommittedPosition > 0) { ByteBuffer byteBuffer = writeBuffer.slice(); byteBuffer.position(lastCommittedPosition); byteBuffer.limit(writePos); this.fileChannel.position(lastCommittedPosition); this.fileChannel.write(byteBuffer); this.committedPosition.set(writePos); } }
The following summarizes the use scenarios, advantages and disadvantages of the three disk brushing mechanisms.
4, Message reading
The message reading logic is much simpler than the writing logic. The following focuses on how to query messages according to offset and key.
4.1 query by offset
The process of reading a message is to first find the physical offset address of the message in the CommitLog from the ConsumeQueue, and then read the entity content of the message from the CommitLog file.
//org.apache.rocketmq.store.DefaultMessageStore::getMessage public GetMessageResult getMessage(final String group, final String topic, final int queueId, final long offset, final int maxMsgNums, final MessageFilter messageFilter) { long nextBeginOffset = offset; GetMessageResult getResult = new GetMessageResult(); final long maxOffsetPy = this.commitLog.getMaxOffset(); //Find the corresponding ConsumeQueue ConsumeQueue consumeQueue = findConsumeQueue(topic, queueId); ... //Find the MappedFile of the corresponding ConsumeQueue according to offset SelectMappedBufferResult bufferConsumeQueue = consumeQueue.getIndexBuffer(offset); status = GetMessageStatus.NO_MATCHED_MESSAGE; long maxPhyOffsetPulling = 0; int i = 0; //The maximum information size that can be returned cannot be greater than 16M final int maxFilterMessageCount = Math.max(16000, maxMsgNums * ConsumeQueue.CQ_STORE_UNIT_SIZE); for (; i < bufferConsumeQueue.getSize() && i < maxFilterMessageCount; i += ConsumeQueue.CQ_STORE_UNIT_SIZE) { //CommitLog physical address long offsetPy = bufferConsumeQueue.getByteBuffer().getLong(); int sizePy = bufferConsumeQueue.getByteBuffer().getInt(); maxPhyOffsetPulling = offsetPy; ... //Get the specific Message from CommitLog according to offset and size SelectMappedBufferResult selectResult = this.commitLog.getMessage(offsetPy, sizePy); ... //Put Message into result set getResult.addMessage(selectResult); status = GetMessageStatus.FOUND; } //Update offset nextBeginOffset = offset + (i / ConsumeQueue.CQ_STORE_UNIT_SIZE); long diff = maxOffsetPy - maxPhyOffsetPulling; long memory = (long) (StoreUtil.TOTAL_PHYSICAL_MEMORY_SIZE * (this.messageStoreConfig.getAccessMessageInMemoryMaxRatio() / 100.0)); getResult.setSuggestPullingFromSlave(diff > memory); ... getResult.setStatus(status); getResult.setNextBeginOffset(nextBeginOffset); return getResult; }
4.2 query by key
The process of reading a message is to find a record in the IndexFile index file with topic and key, and read the entity content of the message from the CommitLog file according to the offset of the CommitLog in the record.
//org.apache.rocketmq.store.DefaultMessageStore::queryMessage public QueryMessageResult queryMessage(String topic, String key, int maxNum, long begin, long end) { QueryMessageResult queryMessageResult = new QueryMessageResult(); long lastQueryMsgTime = end; for (int i = 0; i < 3; i++) { //Gets the physical offset address of the message recorded in the IndexFile index file in the CommitLog file QueryOffsetResult queryOffsetResult = this.indexService.queryOffset(topic, key, maxNum, begin, lastQueryMsgTime); ... for (int m = 0; m < queryOffsetResult.getPhyOffsets().size(); m++) { long offset = queryOffsetResult.getPhyOffsets().get(m); ... MessageExt msg = this.lookMessageByOffset(offset); if (0 == m) { lastQueryMsgTime = msg.getStoreTimestamp(); } ... //Get the message content in the CommitLog file SelectMappedBufferResult result = this.commitLog.getData(offset, false); ... queryMessageResult.addMessage(result); ... } } return queryMessageResult; }
In the IndexFile index file, find the physical offset address of the CommitLog file. The implementation is as follows:
//org.apache.rocketmq.store.index.IndexFile::selectPhyOffset public void selectPhyOffset(final List<Long> phyOffsets, final String key, final int maxNum, final long begin, final long end, boolean lock) { int keyHash = indexKeyHashMethod(key); int slotPos = keyHash % this.hashSlotNum; int absSlotPos = IndexHeader.INDEX_HEADER_SIZE + slotPos * hashSlotSize; //Get the first IndexItme storage location of the same hash value key, that is, the first node of the linked list int slotValue = this.mappedByteBuffer.getInt(absSlotPos); //Traverse linked list nodes for (int nextIndexToRead = slotValue; ; ) { if (phyOffsets.size() >= maxNum) { break; } int absIndexPos = IndexHeader.INDEX_HEADER_SIZE + this.hashSlotNum * hashSlotSize + nextIndexToRead * indexSize; int keyHashRead = this.mappedByteBuffer.getInt(absIndexPos); long phyOffsetRead = this.mappedByteBuffer.getLong(absIndexPos + 4); long timeDiff = (long) this.mappedByteBuffer.getInt(absIndexPos + 4 + 8); int prevIndexRead = this.mappedByteBuffer.getInt(absIndexPos + 4 + 8 + 4); if (timeDiff < 0) { break; } timeDiff *= 1000L; long timeRead = this.indexHeader.getBeginTimestamp() + timeDiff; boolean timeMatched = (timeRead >= begin) && (timeRead <= end); //Add phyOffsets to the qualified results if (keyHash == keyHashRead && timeMatched) { phyOffsets.add(phyOffsetRead); } if (prevIndexRead <= invalidIndex || prevIndexRead > this.indexHeader.getIndexCount() || prevIndexRead == nextIndexToRead || timeRead < begin) { break; } //Continue to traverse the linked list nextIndexToRead = prevIndexRead; } ... }
5, Summary
This paper introduces the core module implementation of RocketMQ storage system from the perspective of source code, including storage architecture, message writing and message reading.
RocketMQ writes all the messages under the Topic into the CommitLog, realizing strict sequential writing. Prevent the Page Cache from being swapped to the swap space through file preheating, so as to reduce the interruption of missing pages when reading and writing files. Use mmap to read and write the CommitLog file, and convert the operation on the file into direct operation on the memory address, which greatly improves the reading and writing efficiency of the file.
For scenarios with high performance requirements and low data consistency requirements, you can enable off heap memory to realize read-write separation and improve disk throughput. In short, the learning of storage modules requires a certain understanding of the principles of the operating system. The extreme performance optimization scheme adopted by the author is worthy of our good study.
6, References
Author: vivo Internet server team - Zhang Zhenglin