Interpretation of mysql source code -- data to file data

Keywords: Database MySQL SQL

1, Data

We talked about the log before the database drop. Today, it's very troublesome to analyze the data drop. But the principle is the same. In the previous analysis, we can clearly know that in MySql, no matter what kind of data, it enters the cache first, and then drops the disk to save. In the database, what is the most important? Of course, it's data, no matter what 2PC you are, what cache, what thread, etc. The ultimate goal is to ensure the safe application of data. To put it bluntly, it is to meet the operation of various SQL statements and support various recovery and backup of data and database migration. Sir Ma didn't say that the DT era will follow, and data is king. Therefore, the domestic database industry has been booming in recent years.

2, Data writing

Data writing starts after the log transaction is committed. That is, the data pages cached in the buffer pool. At this time, if there are special circumstances, such as power failure, how to ensure the security of data? As mentioned earlier, the Redo Log will be processed. Therefore, each transaction submission will trigger a Redo Log. Of course, if Binlog is opened, it will also brush Binlog.
When all transactions are completed (that is, both logs are successfully flushed), MySql will have a double write mechanism to write the data (i.e. dirty data) in the Buffer Pool to double write, which is stored in the shared table space of the database. At this time, the disk has been dropped, but it is still not the real data. The purpose of this is also to ensure the security of incomplete dirty page write failure. Together with Redo Log, it ensures the safe recovery of data. (the granularity of the log is page, and double write is responsible for the security of the data in the page)
In addition, in the operation of database, there are two types of indexes of database tables, namely, clustered index and non clustered index. Clustered index is easy to handle, and it has flat physical space. However, non clustered indexes are not. If they are updated every time, the efficiency will be very low. In fact, at this point, those who have some experience should understand that it's nothing, cache, and finally focus on writing. Yes, MySql is called Insert Buffer.

3, Source code analysis

Analyze the INNODB service status source code:

/** Function to pass InnoDB status variables to MySQL */
void srv_export_innodb_status(void) {
  buf_pool_stat_t stat;
  buf_pools_list_size_t buf_pools_list_size;
  ulint LRU_len;
  ulint free_len;
  ulint flush_list_len;

  buf_get_total_stat(&stat);
  buf_get_total_list_len(&LRU_len, &free_len, &flush_list_len);
  buf_get_total_list_size_in_bytes(&buf_pools_list_size);

  mutex_enter(&srv_innodb_monitor_mutex);

  export_vars.innodb_data_pending_reads = os_n_pending_reads;

  export_vars.innodb_data_pending_writes = os_n_pending_writes;

  export_vars.innodb_data_pending_fsyncs =
      fil_n_pending_log_flushes + fil_n_pending_tablespace_flushes;

  export_vars.innodb_data_fsyncs = os_n_fsyncs;

  export_vars.innodb_data_read = srv_stats.data_read;

  export_vars.innodb_data_reads = os_n_file_reads;

  export_vars.innodb_data_writes = os_n_file_writes;

  export_vars.innodb_data_written = srv_stats.data_written;

  export_vars.innodb_buffer_pool_read_requests =
      Counter::total(stat.m_n_page_gets);

  export_vars.innodb_buffer_pool_write_requests =
      srv_stats.buf_pool_write_requests;

  export_vars.innodb_buffer_pool_wait_free = srv_stats.buf_pool_wait_free;

  export_vars.innodb_buffer_pool_pages_flushed = srv_stats.buf_pool_flushed;

  export_vars.innodb_buffer_pool_reads = srv_stats.buf_pool_reads;

  export_vars.innodb_buffer_pool_read_ahead_rnd = stat.n_ra_pages_read_rnd;

  export_vars.innodb_buffer_pool_read_ahead = stat.n_ra_pages_read;

  export_vars.innodb_buffer_pool_read_ahead_evicted = stat.n_ra_pages_evicted;

  export_vars.innodb_buffer_pool_pages_data = LRU_len;

  export_vars.innodb_buffer_pool_bytes_data =
      buf_pools_list_size.LRU_bytes + buf_pools_list_size.unzip_LRU_bytes;

  export_vars.innodb_buffer_pool_pages_dirty = flush_list_len;

  export_vars.innodb_buffer_pool_bytes_dirty =
      buf_pools_list_size.flush_list_bytes;

  export_vars.innodb_buffer_pool_pages_free = free_len;

#ifdef UNIV_DEBUG
  export_vars.innodb_buffer_pool_pages_latched = buf_get_latched_pages_number();
#endif /* UNIV_DEBUG */
  export_vars.innodb_buffer_pool_pages_total = buf_pool_get_n_pages();

  export_vars.innodb_buffer_pool_pages_misc =
      buf_pool_get_n_pages() - LRU_len - free_len;

  export_vars.innodb_page_size = UNIV_PAGE_SIZE;

  export_vars.innodb_log_waits = srv_stats.log_waits;

  export_vars.innodb_os_log_written = srv_stats.os_log_written;

  export_vars.innodb_os_log_fsyncs = fil_n_log_flushes;

  export_vars.innodb_os_log_pending_fsyncs = fil_n_pending_log_flushes;

  export_vars.innodb_os_log_pending_writes = srv_stats.os_log_pending_writes;

  export_vars.innodb_log_write_requests = srv_stats.log_write_requests;

  export_vars.innodb_log_writes = srv_stats.log_writes;
  //Focus on these two variables
  //Monitors the number of pages of DOBLE WRITE
  export_vars.innodb_dblwr_pages_written = srv_stats.dblwr_pages_written;
  monitor DOUBLE WRITE Number of writes
  export_vars.innodb_dblwr_writes = srv_stats.dblwr_writes;

  export_vars.innodb_pages_created = stat.n_pages_created;

  export_vars.innodb_pages_read = stat.n_pages_read;

  export_vars.innodb_pages_written = stat.n_pages_written;

  export_vars.innodb_redo_log_enabled = srv_redo_log;

  export_vars.innodb_row_lock_waits = srv_stats.n_lock_wait_count;

  export_vars.innodb_row_lock_current_waits =
      srv_stats.n_lock_wait_current_count;

  export_vars.innodb_row_lock_time = srv_stats.n_lock_wait_time / 1000;

  ......

  mutex_exit(&srv_innodb_monitor_mutex);
}

Let's take another look at the process of DoubleWrite. First, there are several functions to start writing, namely buf_flush_page_try,buf_flush_try_neighbors,buf_flush_single_page_from and MMY_ATTRIBUTE, they all call a function:

ibool buf_flush_page(buf_pool_t *buf_pool, buf_page_t *bpage,
                     buf_flush_t flush_type, bool sync) {
  BPageMutex *block_mutex;

  ut_ad(flush_type < BUF_FLUSH_N_TYPES);
  /* Hold the LRU list mutex iff called for a single page LRU
  flush. A single page LRU flush is already non-performant, and holding
  the LRU list mutex allows us to avoid having to store the previous LRU
  list page or to restart the LRU scan in
  buf_flush_single_page_from_LRU(). */
  ut_ad(flush_type == BUF_FLUSH_SINGLE_PAGE ||
        !mutex_own(&buf_pool->LRU_list_mutex));
  ut_ad(flush_type != BUF_FLUSH_SINGLE_PAGE ||
        mutex_own(&buf_pool->LRU_list_mutex));
  ut_ad(buf_page_in_file(bpage));
  ut_ad(!sync || flush_type == BUF_FLUSH_SINGLE_PAGE);

  block_mutex = buf_page_get_mutex(bpage);
  ut_ad(mutex_own(block_mutex));

  ut_ad(buf_flush_ready_for_flush(bpage, flush_type));

  bool is_uncompressed;

  is_uncompressed = (buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE);
  ut_ad(is_uncompressed == (block_mutex != &buf_pool->zip_mutex));

  ibool flush;
  rw_lock_t *rw_lock = nullptr;
  bool no_fix_count = bpage->buf_fix_count == 0;

  if (!is_uncompressed) {
    flush = TRUE;
    rw_lock = nullptr;
  } else if (!(no_fix_count || flush_type == BUF_FLUSH_LIST) ||
             (!no_fix_count &&
              srv_shutdown_state.load() < SRV_SHUTDOWN_FLUSH_PHASE &&
              fsp_is_system_temporary(bpage->id.space()))) {
    /* This is a heuristic, to avoid expensive SX attempts. */
    /* For table residing in temporary tablespace sync is done
    using IO_FIX and so before scheduling for flush ensure that
    page is not fixed. */
    flush = FALSE;
  } else {
    rw_lock = &reinterpret_cast<buf_block_t *>(bpage)->lock;
    if (flush_type != BUF_FLUSH_LIST) {
      flush = rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE);
    } else {
      /* Will SX lock later */
      flush = TRUE;
    }
  }

  if (flush) {
    /* We are committed to flushing by the time we get here */

    mutex_enter(&buf_pool->flush_state_mutex);

    buf_page_set_io_fix(bpage, BUF_IO_WRITE);

    buf_page_set_flush_type(bpage, flush_type);

    if (buf_pool->n_flush[flush_type] == 0) {
      os_event_reset(buf_pool->no_flush[flush_type]);
    }

    ++buf_pool->n_flush[flush_type];

    if (bpage->get_oldest_lsn() > buf_pool->max_lsn_io) {
      buf_pool->max_lsn_io = bpage->get_oldest_lsn();
    }

    if (!fsp_is_system_temporary(bpage->id.space()) &&
        buf_pool->track_page_lsn != LSN_MAX) {
      auto frame = bpage->zip.data;

      if (frame == nullptr) {
        frame = ((buf_block_t *)bpage)->frame;
      }
      lsn_t frame_lsn = mach_read_from_8(frame + FIL_PAGE_LSN);

      arch_page_sys->track_page(bpage, buf_pool->track_page_lsn, frame_lsn,
                                false);
    }

    mutex_exit(&buf_pool->flush_state_mutex);

    mutex_exit(block_mutex);

    if (flush_type == BUF_FLUSH_SINGLE_PAGE) {
      mutex_exit(&buf_pool->LRU_list_mutex);
    }

    if (flush_type == BUF_FLUSH_LIST && is_uncompressed &&
        !rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE)) {
      if (!fsp_is_system_temporary(bpage->id.space()) && dblwr::enabled) {
        dblwr::force_flush(flush_type, buf_pool_index(buf_pool));
      } else {
        buf_flush_sync_datafiles();
      }

      rw_lock_sx_lock_gen(rw_lock, BUF_IO_WRITE);
    }

    /* If there is an observer that wants to know if the
    asynchronous flushing was sent then notify it.
    Note: we set flush observer to a page with x-latch, so we can
    guarantee that notify_flush and notify_remove are called in pair
    with s-latch on a uncompressed page. */
    if (bpage->get_flush_observer() != nullptr) {
      bpage->get_flush_observer()->notify_flush(buf_pool, bpage);
    }

    /* Even though bpage is not protected by any mutex at this
    point, it is safe to access bpage, because it is io_fixed and
    oldest_modification != 0.  Thus, it cannot be relocated in the
    buffer pool or removed from flush_list or LRU_list. */

    buf_flush_write_block_low(bpage, flush_type, sync);
  }

  return (flush);
}

There are several ways to look at the brush plate. There is dblwr::force_flush,buf_flush_sync_datafiles,buf_ flush_ write_ block_ At the same time, if a viewer is waiting for this action, call bpage - > get_ flush_ observer()->notify_ Flush (buf_pool, bpage) to notify relevant observers.
Take a look at the relevant source code and list only some:

void force_flush(buf_flush_t flush_type) noexcept {
  for (;;) {
    mutex_enter(&m_mutex);
    if (!m_buf_pages.empty() && !flush_to_disk(flush_type)) {
      ut_ad(!mutex_own(&m_mutex));
      continue;
    }
    break;
  }
  mutex_exit(&m_mutex);
}
bool flush_to_disk(buf_flush_t flush_type) noexcept {
  ut_ad(mutex_own(&m_mutex));

  /* Wait for any batch writes that are in progress. */
  if (wait_for_pending_batch()) {
    ut_ad(!mutex_own(&m_mutex));
    return false;
  }

  MONITOR_INC(MONITOR_DBLWR_FLUSH_REQUESTS);

  /* Write the pages to disk and free up the buffer. */
  write_pages(flush_type);

  ut_a(m_buffer.empty());
  ut_a(m_buf_pages.empty());

  return true;
}

void Double_write::write_pages(buf_flush_t flush_type) noexcept {
  ut_ad(mutex_own(&m_mutex));
  ut_a(!m_buffer.empty());

  Batch_segment *batch_segment{};

  auto segments = flush_type == BUF_FLUSH_LRU ? s_LRU_batch_segments
                                              : s_flush_list_batch_segments;

  while (!segments->dequeue(batch_segment)) {
    std::this_thread::yield();
  }

  batch_segment->start(this);

  //Call write file
  batch_segment->write(m_buffer);

  m_buffer.clear();

#ifndef _WIN32
  if (is_fsync_required()) {
    batch_segment->flush();
  }
#endif /* !_WIN32 */

  batch_segment->set_batch_size(m_buf_pages.size());

  for (uint32_t i = 0; i < m_buf_pages.size(); ++i) {
    const auto bpage = std::get<0>(m_buf_pages.m_pages[i]);

    ut_d(auto page_id = bpage->id);

    bpage->set_dblwr_batch_id(batch_segment->id());

    ut_d(bpage->take_io_responsibility());
    auto err =
        write_to_datafile(bpage, false, std::get<1>(m_buf_pages.m_pages[i]),
                          std::get<2>(m_buf_pages.m_pages[i]));

    if (err == DB_PAGE_IS_STALE || err == DB_TABLESPACE_DELETED) {
      write_complete(bpage, flush_type);
      buf_page_free_stale_during_write(
          bpage, buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE);

      const file::Block *block = std::get<1>(m_buf_pages.m_pages[i]);
      if (block != nullptr) {
        os_free_block(const_cast<file::Block *>(block));
      }
    } else {
      ut_a(err == DB_SUCCESS);
    }
    /* We don't hold io_responsibility here no matter which path through ifs and
    elses we've got here, but we can't assert:
      ut_ad(!bpage->current_thread_has_io_responsibility());
    because bpage could be freed by the time we got here. */

#ifdef UNIV_DEBUG
    if (dblwr::Force_crash == page_id) {
      DBUG_SUICIDE();
    }
#endif /* UNIV_DEBUG */
  }

  srv_stats.dblwr_writes.inc();

  m_buf_pages.clear();

  os_aio_simulated_wake_handler_threads();
}

dberr_t os_file_write_retry(IORequest &type, const char *name,
                            pfs_os_file_t file, const void *buf,
                            os_offset_t offset, ulint n) {
  dberr_t err;
  for (;;) {
    err = os_file_write(type, name, file, buf, offset, n);

    if (err == DB_SUCCESS || err == DB_TABLESPACE_DELETED) {
      break;
    } else if (err == DB_IO_ERROR) {
      ib::error(ER_INNODB_IO_WRITE_ERROR_RETRYING, name);
      std::chrono::seconds ten(10);
      std::this_thread::sleep_for(ten);
      continue;
    } else {
      ib::fatal(ER_INNODB_IO_WRITE_FAILED, name);
    }
  }
  return err;
}
dberr_t os_file_write_func(IORequest &type, const char *name, os_file_t file,
                           const void *buf, os_offset_t offset, ulint n) {
  ut_ad(type.validate());
  ut_ad(type.is_write());

  /* We never compress the first page.
  Note: This assumes we always do block IO. */
  if (offset == 0) {
    type.clear_compressed();
  }

  const byte *ptr = reinterpret_cast<const byte *>(buf);

  return os_file_write_page(type, name, file, ptr, offset, n,
                            type.get_encrypted_block());
}

Two macros need to be noted here:

#define os_file_write(type, name, file, buf, offset, n) \
  os_file_write_pfs(type, name, file, buf, offset, n)
#define os_file_write_pfs(type, name, file, buf, offset, n) \
  os_file_write_func(type, name, file, buf, offset, n)

The last call is os_. file_ write_ Func this function. Final call os_file_write_page, written to disk. Others, such as table space processing, can be understood by viewing several other written functions.
Of course, in some cases, you can directly brush the disk without DOUBLE WTITE. One is to turn off this option, and the other is similar to some actions that do not need this operation, such as Drop Table.
Then look at the data drop:

//Write on top_ Write is called in the pages function_ to_ datafile
/** Writes a page that has already been written to the
doublewrite buffer to the data file. It is the job of the
caller to sync the datafile.
@param[in]  in_bpage          Page to write.
@param[in]  sync              true if it's a synchronous write.
@param[in]  e_block           block containing encrypted data frame.
@param[in]  e_len             encrypted data length.
@return DB_SUCCESS or error code */
static dberr_t write_to_datafile(const buf_page_t *in_bpage, bool sync,
    const file::Block* e_block, uint32_t e_len)
    noexcept MY_ATTRIBUTE((warn_unused_result));
    dberr_t Double_write::write_to_datafile(const buf_page_t *in_bpage, bool sync,
                                            const file::Block *e_block,
                                            uint32_t e_len) noexcept {
      ut_ad(buf_page_in_file(in_bpage));
      ut_ad(in_bpage->current_thread_has_io_responsibility());
      ut_ad(in_bpage->is_io_fix_write());
      uint32_t len;
      void *frame{};

      if (e_block == nullptr) {
        Double_write::prepare(in_bpage, &frame, &len);
      } else {
        frame = os_block_get_frame(e_block);
        len = e_len;
      }

      /* Our IO API is common for both reads and writes and is
      therefore geared towards a non-const parameter. */
      auto bpage = const_cast<buf_page_t *>(in_bpage);

      uint32_t type = IORequest::WRITE;

      if (sync) {
        type |= IORequest::DO_NOT_WAKE;
      }

      IORequest io_request(type);
      io_request.set_encrypted_block(e_block);

    #ifdef UNIV_DEBUG
      {
        byte *page = static_cast<byte *>(frame);
        ut_ad(mach_read_from_4(page + FIL_PAGE_OFFSET) == bpage->page_no());
        ut_ad(mach_read_from_4(page + FIL_PAGE_SPACE_ID) == bpage->space());
      }
    #endif /* UNIV_DEBUG */

      auto err =
          fil_io(io_request, sync, bpage->id, bpage->size, 0, len, frame, bpage);

      /* When a tablespace is deleted with BUF_REMOVE_NONE, fil_io() might
      return DB_PAGE_IS_STALE or DB_TABLESPACE_DELETED. */
      ut_a(err == DB_SUCCESS || err == DB_TABLESPACE_DELETED ||
           err == DB_PAGE_IS_STALE);

      return err;
    }
    dberr_t fil_io(const IORequest &type, bool sync, const page_id_t &page_id,
                   const page_size_t &page_size, ulint byte_offset, ulint len,
                   void *buf, void *message) {
      auto shard = fil_system->shard_by_id(page_id.space());
    #ifdef UNIV_DEBUG
      if (!sync) {
        /* In case of async io we transfer the io responsibility to the thread which
        will perform the io completion routine. */
        static_cast<buf_page_t *>(message)->release_io_responsibility();
      }
    #endif

      auto const err = shard->do_io(type, sync, page_id, page_size, byte_offset,
                                    len, buf, message);
    #ifdef UNIV_DEBUG
      /* If the error prevented async io, then we haven't actually transfered the
      io responsibility at all, so we revert the debug io responsibility info. */
      if (err != DB_SUCCESS && !sync) {
        static_cast<buf_page_t *>(message)->take_io_responsibility();
      }
    #endif
      return err;
    }

Finally, take a look at the calling function when DOUBLE WRITE is not allowed:

//It means writing data directly to the hard disk. There are generally several situations:
1,DML Huge amount of data
2,Insensitive to detail corruption of data
3,Write load too large

/** Flush a batch of writes to the datafiles that have already been
written to the dblwr buffer on disk. */
static void buf_flush_sync_datafiles() {
  /* Wake possible simulated AIO thread to actually post the
  writes to the operating system */
  os_aio_simulated_wake_handler_threads();

  /* Wait that all async writes to tablespaces have been posted to
  the OS */
  os_aio_wait_until_no_pending_writes();

  /* Now we flush the data to disk (for example, with fsync) */
  fil_flush_file_spaces(FIL_TYPE_TABLESPACE);
}
/** Flush to disk the writes in file spaces of the given type
possibly cached by the OS.
@param[in]	purpose		FIL_TYPE_TABLESPACE or FIL_TYPE_LOG, can be
ORred. */
void fil_flush_file_spaces(uint8_t purpose) {
  fil_system->flush_file_spaces(purpose);
}

The above code is actually very simple. In fact, it is to wake up the operation thread and brush the disk directly. Finally, take a look at asynchronous disk dropping:

/** Does an asynchronous write of a buffer page.
@param[in]	bpage		buffer block to write
@param[in]	flush_type	type of flush
@param[in]	sync		true if sync IO request */
static void buf_flush_write_block_low(buf_page_t *bpage, buf_flush_t flush_type,
                                      bool sync) {
  page_t *frame = nullptr;

#ifdef UNIV_DEBUG
  buf_pool_t *buf_pool = buf_pool_from_bpage(bpage);
  ut_ad(!mutex_own(&buf_pool->LRU_list_mutex));
#endif /* UNIV_DEBUG */

  DBUG_PRINT("ib_buf", ("flush %s %u page " UINT32PF ":" UINT32PF,
                        sync ? "sync" : "async", (unsigned)flush_type,
                        bpage->id.space(), bpage->id.page_no()));

  ut_ad(buf_page_in_file(bpage));

  /* We are not holding block_mutex here. Nevertheless, it is safe to
  access bpage, because it is io_fixed and oldest_modification != 0.
  Thus, it cannot be relocated in the buffer pool or removed from
  flush_list or LRU_list. */
  ut_ad(!buf_flush_list_mutex_own(buf_pool));
  ut_ad(!buf_page_get_mutex(bpage)->is_owned());
  ut_ad(bpage->is_io_fix_write());
  ut_ad(bpage->is_dirty());

#ifdef UNIV_IBUF_COUNT_DEBUG
  ut_a(ibuf_count_get(bpage->id) == 0);
#endif /* UNIV_IBUF_COUNT_DEBUG */

  ut_ad(recv_recovery_is_on() || bpage->get_newest_lsn() != 0);

  /* Force the log to the disk before writing the modified block */
  if (!srv_read_only_mode) {
    const lsn_t flush_to_lsn = bpage->get_newest_lsn();

    /* Do the check before calling log_write_up_to() because in most
    cases it would allow to avoid call, and because of that we don't
    want those calls because they would have bad impact on the counter
    of calls, which is monitored to save CPU on spinning in log threads. */

    if (log_sys->flushed_to_disk_lsn.load() < flush_to_lsn) {
      Wait_stats wait_stats;

      wait_stats = log_write_up_to(*log_sys, flush_to_lsn, true);

      MONITOR_INC_WAIT_STATS_EX(MONITOR_ON_LOG_, _PAGE_WRITTEN, wait_stats);
    }
  }

  DBUG_EXECUTE_IF("log_first_rec_group_test", {
    recv_no_ibuf_operations = false;
    const lsn_t end_lsn = mtr_commit_mlog_test(*log_sys);
    log_write_up_to(*log_sys, end_lsn, true);
    DBUG_SUICIDE();
  });

  switch (buf_page_get_state(bpage)) {
    case BUF_BLOCK_POOL_WATCH:
    case BUF_BLOCK_ZIP_PAGE: /* The page should be dirty. */
    case BUF_BLOCK_NOT_USED:
    case BUF_BLOCK_READY_FOR_USE:
    case BUF_BLOCK_MEMORY:
    case BUF_BLOCK_REMOVE_HASH:
      ut_error;
      break;
    case BUF_BLOCK_ZIP_DIRTY: {
      frame = bpage->zip.data;
      BlockReporter reporter =
          BlockReporter(false, frame, bpage->size,
                        fsp_is_checksum_disabled(bpage->id.space()));

      mach_write_to_8(frame + FIL_PAGE_LSN, bpage->get_newest_lsn());

      ut_a(reporter.verify_zip_checksum());
      break;
    }
    case BUF_BLOCK_FILE_PAGE:
      frame = bpage->zip.data;
      if (!frame) {
        frame = ((buf_block_t *)bpage)->frame;
      }

      buf_flush_init_for_writing(
          reinterpret_cast<const buf_block_t *>(bpage),
          reinterpret_cast<const buf_block_t *>(bpage)->frame,
          bpage->zip.data ? &bpage->zip : nullptr, bpage->get_newest_lsn(),
          fsp_is_checksum_disabled(bpage->id.space()),
          false /* do not skip lsn check */);
      break;
  }

  dberr_t err = dblwr::write(flush_type, bpage, sync);

  ut_a(err == DB_SUCCESS || err == DB_TABLESPACE_DELETED);

  /* Increment the counter of I/O operations used
  for selecting LRU policy. */
  buf_LRU_stat_inc_io();
}
dberr_t dblwr::write(buf_flush_t flush_type, buf_page_t *bpage,
                     bool sync) noexcept {
  dberr_t err;
  const space_id_t space_id = bpage->id.space();

  ut_ad(bpage->current_thread_has_io_responsibility());
  /* This is not required for correctness, but it aborts the processing early.
   */
  if (bpage->was_stale()) {
    /* Disable batch completion in write_complete(). */
    bpage->set_dblwr_batch_id(std::numeric_limits<uint16_t>::max());
    buf_page_free_stale_during_write(
        bpage, buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE);
    /* We don't hold io_responsibility here no matter which path through ifs and
    elses we've got here, but we can't assert:
      ut_ad(!bpage->current_thread_has_io_responsibility());
    because bpage could be freed by the time we got here. */
    return DB_SUCCESS;
  }

  if (srv_read_only_mode || fsp_is_system_temporary(space_id) ||
      !dblwr::enabled || Double_write::s_instances == nullptr ||
      mtr_t::s_logging.dblwr_disabled()) {
    /* Skip the double-write buffer since it is not needed. Temporary
    tablespaces are never recovered, therefore we don't care about
    torn writes. */
    bpage->set_dblwr_batch_id(std::numeric_limits<uint16_t>::max());
    err = Double_write::write_to_datafile(bpage, sync, nullptr, 0);
    if (err == DB_PAGE_IS_STALE || err == DB_TABLESPACE_DELETED) {
      buf_page_free_stale_during_write(
          bpage, buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE);
      err = DB_SUCCESS;
    } else if (sync) {
      ut_ad(flush_type == BUF_FLUSH_LRU || flush_type == BUF_FLUSH_SINGLE_PAGE);

      if (err == DB_SUCCESS) {
        fil_flush(space_id);
      }
      /* true means we want to evict this page from the LRU list as well. */
      buf_page_io_complete(bpage, true);
    }

  } else {
    ut_d(auto page_id = bpage->id);

    /* Encrypt the page here, so that the same encrypted contents are written
    to the dblwr file and the data file. */
    uint32_t e_len{};
    file::Block *e_block = dblwr::get_encrypted_frame(bpage, e_len);

    if (!sync && flush_type != BUF_FLUSH_SINGLE_PAGE) {
      MONITOR_INC(MONITOR_DBLWR_ASYNC_REQUESTS);

      ut_d(bpage->release_io_responsibility());
      Double_write::submit(flush_type, bpage, e_block, e_len);
      err = DB_SUCCESS;
#ifdef UNIV_DEBUG
      if (dblwr::Force_crash == page_id) {
        force_flush(flush_type, buf_pool_index(buf_pool_from_bpage(bpage)));
      }
#endif /* UNIV_DEBUG */
    } else {
      MONITOR_INC(MONITOR_DBLWR_SYNC_REQUESTS);
      /* Disable batch completion in write_complete(). */
      bpage->set_dblwr_batch_id(std::numeric_limits<uint16_t>::max());
      err = Double_write::sync_page_flush(bpage, e_block, e_len);
    }
  }
  /* We don't hold io_responsibility here no matter which path through ifs and
  elses we've got here, but we can't assert:
    ut_ad(!bpage->current_thread_has_io_responsibility());
  because bpage could be freed by the time we got here. */
  return err;
}

write finally calls force_flush, that is, the code remains consistent.

4, Summary

The code is messy, dizzy and dizzy. Fortunately, I can basically see it clearly.
Work hard, returned boy!

Posted by dunn_cann on Fri, 08 Oct 2021 01:36:37 -0700