Is it difficult to start everything - everything starts with taking instructions

7.3 implementation of hummingbird E200 processor

7.3.1 overall design idea of IFU


  • Simply decode the retrieved address
  • Simple branch prediction
  • Generate fetched PC
  • Access the ITCM or BIU according to the address of the PC

Address determination and ICB bus control module control access to ITCM and BIU. Hummingbird E200 is oriented to embedded scenarios with a small amount of code. Assuming that all codes are loaded and executed in ITCM, single cycle finger retrieval can be realized. If it needs to be read from external memory, it can be accessed through BIU. After the value is returned, it is decoded in the same clock cycle. If it is a branch jump instruction, simple BPU predicts in the same cycle. Finally, the decoding result and branch prediction result are used to generate the next instruction PC.

7.3.2 Mini-Decode

The mini decode module does not need to decode all the information, but only needs to obtain whether the instruction has branch jump, branch jump type and details.

Instantiate a complete decode and ignore some signals, so that the synthesis tool optimizes the irrelevant logic

`include "e203_defines.v"

module e203_ifu_minidec(

  // The IR stage to Decoder
  input  [`E203_INSTR_SIZE-1:0] instr,
  // The Decoded Info-Bus

  output dec_rs1en,
  output dec_rs2en,
  output [`E203_RFIDX_WIDTH-1:0] dec_rs1idx,
  output [`E203_RFIDX_WIDTH-1:0] dec_rs2idx,

  output dec_mulhsu,
  output dec_mul   ,
  output dec_div   ,
  output dec_rem   ,
  output dec_divu  ,
  output dec_remu  ,

  output dec_rv32,//Indicates that the current instruction is 16 bit or 32 bit
  output dec_bjp,//Indicates whether the current instruction is a branch jump instruction
  output dec_jal,//Is it a jal instruction
  output dec_jalr,//Is it a jalr instruction
  output dec_bxx,//Is it a conditional branch instruction
  output [`E203_RFIDX_WIDTH-1:0] dec_jalr_rs1idx,
  output [`E203_XLEN-1:0] dec_bjp_imm 


  e203_exu_decode u_e203_exu_decode(

  .i_pc(`E203_PC_SIZE'b0),//Unused input signals are connected to 0

  .i_misalgn (1'b0),
  .i_buserr  (1'b0),

  .dbg_mode  (1'b0),

      .dec_misalgn(),//Unused output signals are suspended


  .dec_mul   (dec_mul   ),
  .dec_div   (dec_div   ),
  .dec_rem   (dec_rem   ),
  .dec_divu  (dec_divu  ),
  .dec_remu  (dec_remu  ),

  .dec_bjp (dec_bjp ),
  .dec_jal (dec_jal ),
  .dec_bxx (dec_bxx ),

  .dec_bjp_imm    (dec_bjp_imm    )  


7.3.3 simple BPU branch prediction

`include "e203_defines.v"

module e203_ifu_litebpu(

  // Current PC
  input  [`E203_PC_SIZE-1:0] pc,

  // The mini-decoded info 
  input  dec_jal,
  input  dec_jalr,
  input  dec_bxx,
  input  [`E203_XLEN-1:0] dec_bjp_imm,
  input  [`E203_RFIDX_WIDTH-1:0] dec_jalr_rs1idx,

  // The IR index and OITF status to be used for checking dependency
  input  oitf_empty,
  input  ir_empty,
  input  ir_rs1en,
  input  jalr_rs1idx_cam_irrdidx,
  // The add op to next-pc adder
  output bpu_wait,  
  output prdt_taken,  
  output [`E203_PC_SIZE-1:0] prdt_pc_add_op1,  
  output [`E203_PC_SIZE-1:0] prdt_pc_add_op2,

  input  dec_i_valid,

  // The RS1 to read regfile
  output bpu2rf_rs1_ena,
  input  ir_valid_clr,
  input  [`E203_XLEN-1:0] rf2bpu_x1,
  input  [`E203_XLEN-1:0] rf2bpu_rs1,

  input  clk,
  input  rst_n
  // The JAL and JALR is always jump, bxxx backward is predicted as taken  
  assign prdt_taken   = (dec_jal | dec_jalr | (dec_bxx & dec_bjp_imm[`E203_XLEN-1]));  //dec_bjp_imm is the offset of the conditional jump instruction. If it is a negative number, it is a backward jump
  // The JALR with rs1 == x1 have dependency or xN have dependency
    wire dec_jalr_rs1x0 = (dec_jalr_rs1idx == `E203_RFIDX_WIDTH'd0);//Determine rs1 index number
  wire dec_jalr_rs1x1 = (dec_jalr_rs1idx == `E203_RFIDX_WIDTH'd1);
  wire dec_jalr_rs1xn = (~dec_jalr_rs1x0) & (~dec_jalr_rs1x1);

  wire jalr_rs1x1_dep = dec_i_valid & dec_jalr & dec_jalr_rs1x1 & ((~oitf_empty) | (jalr_rs1idx_cam_irrdidx));//Determine whether there is RAW data correlation
  wire jalr_rs1xn_dep = dec_i_valid & dec_jalr & dec_jalr_rs1xn & ((~oitf_empty) | (~ir_empty));

                      // If only depend to IR stage (OITF is empty), then if IR is under clearing, or
                          // it does not use RS1 index, then we can also treat it as non-dependency
  wire jalr_rs1xn_dep_ir_clr = (jalr_rs1xn_dep & oitf_empty & (~ir_empty)) & (ir_valid_clr | (~ir_rs1en));

  wire rs1xn_rdrf_r;
  wire rs1xn_rdrf_set = (~rs1xn_rdrf_r) & dec_i_valid & dec_jalr & dec_jalr_rs1xn & ((~jalr_rs1xn_dep) | jalr_rs1xn_dep_ir_clr);
  wire rs1xn_rdrf_clr = rs1xn_rdrf_r;
  wire rs1xn_rdrf_ena = rs1xn_rdrf_set |   rs1xn_rdrf_clr;
  wire rs1xn_rdrf_nxt = rs1xn_rdrf_set | (~rs1xn_rdrf_clr);

  sirv_gnrl_dfflr #(1) rs1xn_rdrf_dfflrs(rs1xn_rdrf_ena, rs1xn_rdrf_nxt, rs1xn_rdrf_r, clk, rst_n);

  assign bpu2rf_rs1_ena = rs1xn_rdrf_set;

  assign bpu_wait = jalr_rs1x1_dep | jalr_rs1xn_dep | rs1xn_rdrf_set;//If there is data correlation, wait

  assign prdt_pc_add_op1 = (dec_bxx | dec_jal) ? pc[`E203_PC_SIZE-1:0]
                         : (dec_jalr & dec_jalr_rs1x0) ? `E203_PC_SIZE'b0
                         : (dec_jalr & dec_jalr_rs1x1) ? rf2bpu_x1[`E203_PC_SIZE-1:0]
                         : rf2bpu_rs1[`E203_PC_SIZE-1:0];  

    assign prdt_pc_add_op2 = dec_bjp_imm[`E203_PC_SIZE-1:0];  //Generate two operands and send them to the adder for calculation


It can be seen that all the combinational logic in the code is completed by assign, and the triggers involved are instantiated by templates, which is conducive to the realization of the circuit desired by the author by the synthesis tool.

7.3.4 PC generation

PC Auto increment: if the current instruction is 32 bits, the next instruction is PC+4, otherwise + 2. Data input is calculated by a shared adder.

7.3.5 access to ITCM and BIU

Residual cache technology

  • 64 bit SRAM reads out 64 bits each time, but IFU takes 32 bits each time. You can read directly without SRAM enable next time.
  • If the address is not aligned, the current maximum 16 bits of SRAM can be stored in the remaining cache, and a new read operation can be initiated, and then the newly read low 16 bits and the data in the remaining cache can be spliced
  • If the addresses are not aligned during non sequential finger fetching, using two cycles to read will result in a cycle performance waste, which is inevitable.

There are a lot of codes. I can't sort them out. Let's jump first

7.3.6 ITCM

64 bit ITCM can achieve lower power consumption overhead

  • For SRAM with small capacity, 64 bits are more compact than 32 bits. Therefore, with the same capacity, the 64 bit data width area is smaller.
  • Taking out 64 bits at a time can consume less dynamic power consumption

7.4 summary

Posted by iankent on Sun, 05 Dec 2021 09:06:24 -0800