乘法器

非常简单。

1
2
3
4
5
6
7
module MUL(
input logic [31:0] src1,
input logic [31:0] src2,
output logic [63:0] out
)
assign out = src1 * src2;
endmodule

结束。


我们肯定不能这样子实现乘法器——即使这种写法能被综合为DSP模块。为什么?因为乘法实在是太太太长了,时间太久。如果将这么一大坨逻辑塞在EX级,时钟频率一定会非常难看。

怎么办?我们将乘法拆分开,在每个时钟周期实现一部分。这样子,可以稍微改善一点时序。

四级流水线乘法器 MUL.sv

首先思考一下:多周期的流水线乘法器会带来哪些额外的时序控制与竞争冒险?

首先是模块要给出信号,来表示自己“是否完成当前运算”以及“是否能接受新的运算”。此外,在执行时,需要将乘法指令用到的寄存器与写回的目标寄存器记住,否则当指令在旁流水线的乘法模块执行时,结果尚未算出,但后面一条指令需要用到结果,这样就必须阻塞流水线。再如,计算后写回时,如果写回的目标寄存器与目前执行完毕的指令写回寄存器一致,则应当选择最后的值进行写回。要考虑的还真不少。

乘法指令拆分

对于两个32位的数相乘,我们可以拆成低16位和高16位,再两两相乘,最后将四部分相加。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Stage 1
pp_ll_comb = src1_signed[15:0] * src2_signed[15:0];

// Stage 2
always_comb begin
// A_lo * B_hi (16-bit unsigned * 17-bit signed = 33-bit signed)
pp_lh_comb = $signed({1'b0, s1_a_lo}) * s1_b_hi;
// A_hi * B_lo (17-bit signed * 16-bit unsigned = 33-bit signed)
pp_hl_comb = s1_a_hi * $signed({1'b0, s1_b_lo});
end

// Stage 3
always_comb begin
// A_hi * B_hi (17-bit signed * 17-bit signed = 34-bit signed)
pp_hh_comb = s2_a_hi * s2_b_hi;
// Sum of middle partial products (with sign extension)
pp_mid_comb = $signed(s2_pp_lh) + $signed(s2_pp_hl);
end

// Stage 4
product_comb = {32'b0, s3_pp_ll} + ({{30{s3_pp_mid[33]}}, s3_pp_mid} << 16) +
({{30{s3_pp_hh[33]}}, s3_pp_hh} << 32);

流水线传递信号

我们要在内部传递一堆计算值,以及当前乘法计算是否有效的信号。在最后一级时传出,代表准备输出。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
s1_valid <= 1'b0;
s1_op <= 2'b0;
s1_rd <= 5'b0;
s1_canceled <= 1'b0;
s1_a_hi <= 17'b0;
s1_b_hi <= 17'b0;
s1_a_lo <= 16'b0;
s1_b_lo <= 16'b0;
s1_pp_ll <= 32'b0;
end else if (flush_i) begin
s1_valid <= 1'b0;
s1_op <= 2'b0;
s1_rd <= 5'b0;
s1_canceled <= 1'b0;
s1_a_hi <= 17'b0;
s1_b_hi <= 17'b0;
s1_a_lo <= 16'b0;
s1_b_lo <= 16'b0;
s1_pp_ll <= 32'b0;
end else begin
s1_valid <= mul_valid_i;
s1_op <= mul_op_i;
s1_rd <= mul_rd_i;
// Check if this stage should be canceled (WAW without RAW)
s1_canceled <= (cancel_rd_i != 5'b0) && (cancel_rd_i == mul_rd_i) && mul_valid_i;
// Store split operands for next stage
s1_a_hi <= src1_signed[32:16];
s1_b_hi <= src2_signed[32:16];
s1_a_lo <= src1_signed[15:0];
s1_b_lo <= src2_signed[15:0];
s1_pp_ll <= pp_ll_comb;
end
end

// ...

always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
s4_valid <= 1'b0;
s4_op <= 2'b0;
s4_rd <= 5'b0;
s4_canceled <= 1'b0;
s4_product <= 64'b0;
end else if (flush_i) begin
s4_valid <= 1'b0;
s4_op <= 2'b0;
s4_rd <= 5'b0;
s4_canceled <= 1'b0;
s4_product <= 64'b0;
end else begin
s4_valid <= s3_valid;
s4_op <= s3_op;
s4_rd <= s3_rd;
// Propagate cancel or detect new cancel for this stage
s4_canceled <= s3_canceled ||
((cancel_rd_i != 5'b0) && (cancel_rd_i == s3_rd) && s3_valid);
s4_product <= product_comb;
end
end

RegisterF 修改

因为我们额外设置了乘法器的计算结果,它不能和主流水线的计算结果一同写入,因为寄存器堆只有一个写端口。如果区分先后写入的话,则又要阻塞一个周期,而我们设置成流水线级的乘法器为的就是尽可能减少阻塞。因此,修改寄存器堆为双端口写回:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// 写入使用时序逻辑 - 支持双写端口
// 当两个端口同时写入同一寄存器时主流水线端口优先
// 因乘法为长指令 后写回的短指令在时序上更靠后因此结果更新
// 应当采用更新的寄存器值
always_ff @(posedge clk) begin
// 主流水线写端口
if (rf_we && wR != 5'd0) begin
rf_in[wR] <= wD;
end
// 乘法器写端口(优先级更低)
if (rf_we2 && wR2 != 5'd0 && !(rf_we && wR == wR2)) begin
rf_in[wR2] <= wD2;
end
end

HazardUnit 修改

新的冲突,新的分析

对于双端口的写回寄存器来说,乘法计算结果与主流水线的计算结果独立,因此在写回时,可能出现写到同一个目标寄存器的现象。参考上面的注释,乘法为长指令,后写回的短指令在时序上更靠后,因此计算结果较新。按照ISA规范,后执行的指令必须最后写入,否则会破坏程序正确性。

首先检测所有可能的 WAW 冲突,若乘法器中该级有有效的MUL指令,且ID级要写寄存器并不为x0,以及目标寄存器相同,则存在WAW 冲突。

如果 ID 指令读取了 MUL 的目标寄存器,说明这是个WAW+RAW的冲突,必须为RAW冲突阻塞流水线,因为乘法结果在第四个周期才计算完成:

1
2
3
4
5
6
7
8
9
10
11
12
# Cycle 1
MUL x5, x1, x2 # S1, 将写 x5

# Cycle 2
ADD x6, x5, x7 # ID, 读 x5(RAW依赖)

# Cycle 3
MUL x5, x3, x4 # ID, 读 x5,写 x5(WAW + RAW)
# 检测结果:
# mul_waw_conflict[1] = 1 (与S1的第一条MUL冲突)
# id_reads_mul_rd[1] = 1 (ID读取了x5)
# mul_waw_hazard = 1 → 停顿流水线
Cycle PC IF ID EX MUL S1→S4 操作
1 0x00 MUL1 - - - MUL1进入S1
2 0x04 MUL2 MUL1 S1 检测WAW+RAW
mul_waw_hazard=1
3 S2 停顿(保持PC)
4 S3 停顿
5 S4
6 0x08 ... MUL2 MUL1 - MUL1完成,MUL2读到正确x5

此外,还存在纯WAW冲突,即“两个写回端口相同”,但不存在RAW冲突:

1
2
3
4
5
6
7
8
9
10
# Cycle 1
MUL x5, x1, x2 # S1, 将写 x5

# Cycle 2
ADD x5, x3, x4 # ID, 写 x5,但不读 x5(仅WAW)
# 检测结果:
# mul_waw_conflict[1] = 1 (与S1的MUL冲突)
# id_reads_mul_rd[1] = 0 (ID不读x5)
# pure_waw_conflict = 1
# mul_cancel_rd = x5 → 取消第一条MUL的写回

对于这一种冲突,使用mul_cancel_rd信号来设置取消写回的寄存器地址。

Cycle PC IF ID EX MEM WB MUL S1→S4 操作
1 0x00 MUL - - - - - MUL进入S1
2 0x04 ADD MUL S1 检测纯WAW
mul_cancel_rd=x5
3 0x08 ... ADD MUL S2(canceled) MUL标记为取消
4 0x0C ... ADD MUL S3(canceled) ADD正常执行
5 0x10 ... ADD MUL S4(canceled) ADD写x5
6 0x14 ... ADD - MUL不写x5

WAW依赖相关逻辑

修改HazardUnit模块内部,加入传出的乘法器内流水线寄存器暂存的信号:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
module HazardUnit(
// ..
// 乘法器状态信号 (4级流水线)
// 约定:mul_stage_busy[0]=S1 ... [3]=S4;mul_rd_s[0]=S1 ... [3]=S4
input logic [ 3:0] mul_stage_busy, // 乘法器各级流水线忙状态
input logic [ 3:0][4:0] mul_rd_s, // 乘法器各级流水线目标寄存器地址
input logic is_mul_instr_ID, // ID级是否为乘法指令
input logic is_mul_instr_EX, // EX级是否为乘法指令
// ..
// 乘法器写回无效化信号 (WAW冒险时取消MUL写回)
output logic [ 4:0] mul_cancel_rd
)


// ------------------------------------------------------------
// MUL 冒险判断(RAW + 结构冒险 + WAW处理)
// 乘法器 4 级流水:S1->S2->S3->S4(结果在S4末尾可用)
// ------------------------------------------------------------
logic mul_use_hazard;
logic mul_struct_hazard;
logic mul_waw_hazard;
logic pure_waw_conflict;

// 结构冒险:S1被占用时,新的乘法指令不能进入
assign mul_struct_hazard = is_mul_instr_ID && mul_stage_busy[0];

// 将 EX + S1..S4 统一成 5 路,便于循环处理
// mul_rd_all[0]=EX,mul_rd_all[1]=S1 ... mul_rd_all[4]=S4
logic [4:0][4:0] mul_rd_all;
logic [4:0] mul_vld_all;
assign mul_rd_all = {mul_rd_s, wR_EX};
assign mul_vld_all = {mul_stage_busy, is_mul_instr_EX};

// Debug/可视化向量(可在波形里直接看每一级是否命中)
logic [4:0] mul_raw_hit_r1;
logic [4:0] mul_raw_hit_r2;
logic [4:0] id_reads_mul_rd;
logic [4:0] mul_waw_conflict;

always_comb begin
mul_raw_hit_r1 = '0;
mul_raw_hit_r2 = '0;
id_reads_mul_rd = '0;
mul_waw_conflict = '0;

for (int i = 0; i < 5; i++) begin
// RAW:ID读取 rR1/rR2,且命中任一在飞MUL的rd
mul_raw_hit_r1[i] = mul_vld_all[i] && (mul_rd_all[i] != 5'd0) && rs1_used_ID &&
(mul_rd_all[i] == rR1_ID);
mul_raw_hit_r2[i] = mul_vld_all[i] && (mul_rd_all[i] != 5'd0) && rs2_used_ID &&
(mul_rd_all[i] == rR2_ID);

// ID是否读取了该rd(用于区分 WAW 需要停顿 / 仅取消写回)
id_reads_mul_rd[i] = (mul_rd_all[i] != 5'd0) &&
((rs1_used_ID && (rR1_ID == mul_rd_all[i])) ||
(rs2_used_ID && (rR2_ID == mul_rd_all[i])));

// WAW冲突:ID将写 wR_ID,且与某级MUL rd 相同
mul_waw_conflict[i] = mul_vld_all[i] && rf_we_ID && (wR_ID != 5'd0) &&
(mul_rd_all[i] == wR_ID);
end

mul_use_hazard = (|mul_raw_hit_r1) || (|mul_raw_hit_r2);

// WAW冒险:只有 WAW + 同时存在RAW读依赖 才需要停顿
mul_waw_hazard = |(mul_waw_conflict & id_reads_mul_rd);
pure_waw_conflict = |(mul_waw_conflict & ~id_reads_mul_rd);
end

// 纯WAW冲突(无RAW依赖)时,取消MUL对该寄存器的写回
assign mul_cancel_rd = pure_waw_conflict ? wR_ID : 5'd0;

// ------------------------------------------------------------
// 流水线冲刷与停顿
// ------------------------------------------------------------
logic any_hazard;
assign any_hazard = load_use_hazard || mul_use_hazard || mul_struct_hazard || mul_waw_hazard;

always_comb begin
keep_pc = any_hazard;
stall_IF_ID = any_hazard;
flush_IF_ID = branch_predicted_result;
flush_ID_EX = (branch_predicted_result || any_hazard);
end

endmodule