乘法器
非常简单。
1 2 3 4 5 6 7
| module MUL( input logic [31:0] src1, input logic [31:0] src2, output logic [63:0] out ) assign out = src1 * src2; endmodule
|
结束。
我们肯定不能这样子实现乘法器——即使这种写法能被综合为DSP模块。为什么?因为乘法实在是太太太长了,时间太久。如果将这么一大坨逻辑塞在EX级,时钟频率一定会非常难看。
怎么办?我们将乘法拆分开,在每个时钟周期实现一部分。这样子,可以稍微改善一点时序。
首先思考一下:多周期的流水线乘法器会带来哪些额外的时序控制与竞争冒险?
首先是模块要给出信号,来表示自己“是否完成当前运算”以及“是否能接受新的运算”。此外,在执行时,需要将乘法指令用到的寄存器与写回的目标寄存器记住,否则当指令在旁流水线的乘法模块执行时,结果尚未算出,但后面一条指令需要用到结果,这样就必须阻塞流水线。再如,计算后写回时,如果写回的目标寄存器与目前执行完毕的指令写回寄存器一致,则应当选择最后的值进行写回。要考虑的还真不少。
乘法指令拆分
对于两个32位的数相乘,我们可以拆成低16位和高16位,再两两相乘,最后将四部分相加。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| pp_ll_comb = src1_signed[15:0] * src2_signed[15:0];
always_comb begin pp_lh_comb = $signed({1'b0, s1_a_lo}) * s1_b_hi; pp_hl_comb = s1_a_hi * $signed({1'b0, s1_b_lo}); end
always_comb begin pp_hh_comb = s2_a_hi * s2_b_hi; pp_mid_comb = $signed(s2_pp_lh) + $signed(s2_pp_hl); end
product_comb = {32'b0, s3_pp_ll} + ({{30{s3_pp_mid[33]}}, s3_pp_mid} << 16) + ({{30{s3_pp_hh[33]}}, s3_pp_hh} << 32);
|
流水线传递信号
我们要在内部传递一堆计算值,以及当前乘法计算是否有效的信号。在最后一级时传出,代表准备输出。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
| always_ff @(posedge clk or negedge rst_n) begin if (!rst_n) begin s1_valid <= 1'b0; s1_op <= 2'b0; s1_rd <= 5'b0; s1_canceled <= 1'b0; s1_a_hi <= 17'b0; s1_b_hi <= 17'b0; s1_a_lo <= 16'b0; s1_b_lo <= 16'b0; s1_pp_ll <= 32'b0; end else if (flush_i) begin s1_valid <= 1'b0; s1_op <= 2'b0; s1_rd <= 5'b0; s1_canceled <= 1'b0; s1_a_hi <= 17'b0; s1_b_hi <= 17'b0; s1_a_lo <= 16'b0; s1_b_lo <= 16'b0; s1_pp_ll <= 32'b0; end else begin s1_valid <= mul_valid_i; s1_op <= mul_op_i; s1_rd <= mul_rd_i; s1_canceled <= (cancel_rd_i != 5'b0) && (cancel_rd_i == mul_rd_i) && mul_valid_i; s1_a_hi <= src1_signed[32:16]; s1_b_hi <= src2_signed[32:16]; s1_a_lo <= src1_signed[15:0]; s1_b_lo <= src2_signed[15:0]; s1_pp_ll <= pp_ll_comb; end end
always_ff @(posedge clk or negedge rst_n) begin if (!rst_n) begin s4_valid <= 1'b0; s4_op <= 2'b0; s4_rd <= 5'b0; s4_canceled <= 1'b0; s4_product <= 64'b0; end else if (flush_i) begin s4_valid <= 1'b0; s4_op <= 2'b0; s4_rd <= 5'b0; s4_canceled <= 1'b0; s4_product <= 64'b0; end else begin s4_valid <= s3_valid; s4_op <= s3_op; s4_rd <= s3_rd; s4_canceled <= s3_canceled || ((cancel_rd_i != 5'b0) && (cancel_rd_i == s3_rd) && s3_valid); s4_product <= product_comb; end end
|
RegisterF 修改
因为我们额外设置了乘法器的计算结果,它不能和主流水线的计算结果一同写入,因为寄存器堆只有一个写端口。如果区分先后写入的话,则又要阻塞一个周期,而我们设置成流水线级的乘法器为的就是尽可能减少阻塞。因此,修改寄存器堆为双端口写回:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
|
always_ff @(posedge clk) begin if (rf_we && wR != 5'd0) begin rf_in[wR] <= wD; end if (rf_we2 && wR2 != 5'd0 && !(rf_we && wR == wR2)) begin rf_in[wR2] <= wD2; end end
|
HazardUnit 修改
新的冲突,新的分析
对于双端口的写回寄存器来说,乘法计算结果与主流水线的计算结果独立,因此在写回时,可能出现写到同一个目标寄存器的现象。参考上面的注释,乘法为长指令,后写回的短指令在时序上更靠后,因此计算结果较新。按照ISA规范,后执行的指令必须最后写入,否则会破坏程序正确性。
首先检测所有可能的 WAW 冲突,若乘法器中该级有有效的MUL指令,且ID级要写寄存器并不为x0,以及目标寄存器相同,则存在WAW 冲突。
如果 ID 指令读取了 MUL 的目标寄存器,说明这是个WAW+RAW的冲突,必须为RAW冲突阻塞流水线,因为乘法结果在第四个周期才计算完成:
1 2 3 4 5 6 7 8 9 10 11 12
| # Cycle 1 MUL x5, x1, x2 # S1, 将写 x5
# Cycle 2 ADD x6, x5, x7 # ID, 读 x5(RAW依赖)
# Cycle 3 MUL x5, x3, x4 # ID, 读 x5,写 x5(WAW + RAW) # 检测结果: # mul_waw_conflict[1] = 1 (与S1的第一条MUL冲突) # id_reads_mul_rd[1] = 1 (ID读取了x5) # mul_waw_hazard = 1 → 停顿流水线
|
| Cycle |
PC |
IF |
ID |
EX |
MUL S1→S4 |
操作 |
| 1 |
0x00 |
MUL1 |
- |
- |
- |
MUL1进入S1 |
| 2 |
0x04 |
MUL2 |
MUL1 |
S1 |
检测WAW+RAW
mul_waw_hazard=1 |
| 3 |
S2 |
停顿(保持PC) |
| 4 |
S3 |
停顿 |
| 5 |
S4 |
| 6 |
0x08 |
... |
MUL2 |
MUL1 |
- |
MUL1完成,MUL2读到正确x5 |
此外,还存在纯WAW冲突,即“两个写回端口相同”,但不存在RAW冲突:
1 2 3 4 5 6 7 8 9 10
| # Cycle 1 MUL x5, x1, x2 # S1, 将写 x5
# Cycle 2 ADD x5, x3, x4 # ID, 写 x5,但不读 x5(仅WAW) # 检测结果: # mul_waw_conflict[1] = 1 (与S1的MUL冲突) # id_reads_mul_rd[1] = 0 (ID不读x5) # pure_waw_conflict = 1 # mul_cancel_rd = x5 → 取消第一条MUL的写回
|
对于这一种冲突,使用mul_cancel_rd信号来设置取消写回的寄存器地址。
| Cycle |
PC |
IF |
ID |
EX |
MEM |
WB |
MUL S1→S4 |
操作 |
| 1 |
0x00 |
MUL |
- |
- |
- |
- |
- |
MUL进入S1 |
| 2 |
0x04 |
ADD |
MUL |
S1 |
检测纯WAW
mul_cancel_rd=x5 |
| 3 |
0x08 |
... |
ADD |
MUL |
S2(canceled) |
MUL标记为取消 |
| 4 |
0x0C |
... |
ADD |
MUL |
S3(canceled) |
ADD正常执行 |
| 5 |
0x10 |
... |
ADD |
MUL |
S4(canceled) |
ADD写x5 |
| 6 |
0x14 |
... |
ADD |
- |
MUL不写x5 |
WAW依赖相关逻辑
修改HazardUnit模块内部,加入传出的乘法器内流水线寄存器暂存的信号:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
| module HazardUnit( input logic [ 3:0] mul_stage_busy, input logic [ 3:0][4:0] mul_rd_s, input logic is_mul_instr_ID, input logic is_mul_instr_EX, output logic [ 4:0] mul_cancel_rd )
logic mul_use_hazard; logic mul_struct_hazard; logic mul_waw_hazard; logic pure_waw_conflict;
assign mul_struct_hazard = is_mul_instr_ID && mul_stage_busy[0];
logic [4:0][4:0] mul_rd_all; logic [4:0] mul_vld_all; assign mul_rd_all = {mul_rd_s, wR_EX}; assign mul_vld_all = {mul_stage_busy, is_mul_instr_EX};
logic [4:0] mul_raw_hit_r1; logic [4:0] mul_raw_hit_r2; logic [4:0] id_reads_mul_rd; logic [4:0] mul_waw_conflict;
always_comb begin mul_raw_hit_r1 = '0; mul_raw_hit_r2 = '0; id_reads_mul_rd = '0; mul_waw_conflict = '0;
for (int i = 0; i < 5; i++) begin mul_raw_hit_r1[i] = mul_vld_all[i] && (mul_rd_all[i] != 5'd0) && rs1_used_ID && (mul_rd_all[i] == rR1_ID); mul_raw_hit_r2[i] = mul_vld_all[i] && (mul_rd_all[i] != 5'd0) && rs2_used_ID && (mul_rd_all[i] == rR2_ID);
id_reads_mul_rd[i] = (mul_rd_all[i] != 5'd0) && ((rs1_used_ID && (rR1_ID == mul_rd_all[i])) || (rs2_used_ID && (rR2_ID == mul_rd_all[i])));
mul_waw_conflict[i] = mul_vld_all[i] && rf_we_ID && (wR_ID != 5'd0) && (mul_rd_all[i] == wR_ID); end
mul_use_hazard = (|mul_raw_hit_r1) || (|mul_raw_hit_r2);
mul_waw_hazard = |(mul_waw_conflict & id_reads_mul_rd); pure_waw_conflict = |(mul_waw_conflict & ~id_reads_mul_rd); end
assign mul_cancel_rd = pure_waw_conflict ? wR_ID : 5'd0;
logic any_hazard; assign any_hazard = load_use_hazard || mul_use_hazard || mul_struct_hazard || mul_waw_hazard;
always_comb begin keep_pc = any_hazard; stall_IF_ID = any_hazard; flush_IF_ID = branch_predicted_result; flush_ID_EX = (branch_predicted_result || any_hazard); end
endmodule
|