Alberto Nannarelli Ph.D. Dissertation

4.4 Radix-16 Division

4.4.1 Algorithm and Implementation

The radix-16 division algorithm is implemented by the residual recurrence

w[j+1] = 16w[j] - q_j+1d j = 0,1, ¼13

with initial value w[0] = x , and quotient given by

q =

14
å
j = 1

q_j 16^-j

(20)

Two additional cycles, for initialization and rounding, are required to produce the quotient in conventional representation (53 bits for IEEE double-precision) for a total of 16 cycles. As usual, both d and x are normalized in [0.5, 1) and x < d.

The radix-16 division unit is obtained by overlapping the computation of two radix-4 digits [30]. Consequently, the quotient digit is split into two parts q_H and q_L such that q_j = 4q_H + q_L with digit set {-2,-1,0,1,2} in each part, resulting in the digit-set [-10,10] for q_j (a = 10). The quotient digit is determined, at each iteration, by the selection function depicted in Figure 4.17. Once the digit q_H is chosen, its value is used to select among all the possible combinations of q_H d. The redundancy factor is r = [a/(r-1)] = [2/3]. The residual w[j] is stored in carry-save representation (w_S and w_C). The signed-digit representation of the quotient is converted to conventional two's complement representation and rounded by the on-the-fly convert-and-round unit.

Figure 4.17: Selection function for radix-16.

The implementation of the standard divider, optimized for shortest latency, is shown in Figure 4.18. Table 4.6 shows the delay through the two parts of SEL. Note that the larger delay of SELq_L is compensated by the additional carry-save adder in the recurrence (Figure 4.18) in the path from SELq_H.

The critical path post-layout is 9.2 ns and 16 cycles are required to complete the operation, corresponding to a latency of 150 ns. The energy dissipated by this basic implementation is E_div = 46.0 nJ and the contributions of the blocks is shown in Table 4.9 in column ßtandard".

	path		delay [ns]
q_L	SELq_L - MULT - HA - REG
	5.7 + 1.4 + 0.6 + 1.5	=	9.2
q_H	SELq_H - MULT - HA - FA - REG
	4.0 + 1.4 + 0.6 + 1.1 + 1.5	=	8.6

Table 4.6: Critical path through q_L and q_H.

Figure 4.18: Basic implementation radix-16.

4.4.2 Low-Power Implementation

The retiming of the recurrence is done by moving the selection function from the first part of the cycle to the last part of the previous cycle. Two 4-bit registers are needed to store the quotient digit. After the retiming the critical path is limited to the 10 most-significant bits of the recurrence. As shown in Figure 4.19, in order not to increase the cycle delay in the retimed unit, the clock of register q_L is skewed.

Since in the retimed implementation the selection function is placed after the second CSA, instead of directly after the registers, there is a large increase in the number of glitches, which are responsible for the increased dissipation of the selection function. One way to filter those glitches is to buffer the selection function with multiplexers acting as latches, as described in Figure 3.13 at page pageref. The select signal is driven by a different clock (same period, different phase) that enables the muxes to transmit the value from the CSA when it is stable, and hold the current value otherwise. However, in this case the delay of the mux affects the critical path. For radix-16 the energy dissipated in the selection function is halved, but the critical path is increased by about 5%. For this reason, the value is not included in Table 4.9.

For the 44 least-significant bits the radix-2 CSA is replaced by a radix-16 CSA (R16 CSA) that for each digit of the radix stores only one carry bit. Figure 4.20 shows the radix-16 CSA and Table 4.7 explains how the two level of adders are connected to produce the correct residual. The number of flip-flops in register Wc is reduced from 57 to 25.

Figure 4.19: Retiming and critical path.

Figure 4.20: Radix-16 CSA.


iter. j		iteration j+1

		first level					second level

w_s0		0 m_aS		m_a0	S₀		0	m_bS	m_b0	w_s0
w_s1		0		m_a1	S₁		0		m_b1	w_s1
w_s2	w_c2	4w_s0		m_a2	S₂		4S₀		m_b2	w_s2	w_c2
w_s3		4w_s1		m_a3	S₃		4S₁		m_b3	w_s3
w_s4		4w_s2	4w_c2	m_a4	S₄	C₄	4S₂		m_b4	w_s4
w_s5		4w_s3		m_a5	S₅		4S₃		m_b5	w_s5
w_s6	w_c6	4w_s4		m_a6	S₆		4S₄	4C₄	m_b6	w_s6	w_c6
w_s7		4w_s5		m_a7	S₇		4S₅		m_b7	w_s7
w_s8		4w_s6	4w_c6	m_a8	S₈	C₈	4S₆		m_b8	w_s8
w_s9		4w_s7		m_a9	S₉		4S₇		m_b9	w_s9
w_s10	w_c10	4w_s8		m_a10	S₁₀		4S₈	4C₈	m_b10	w_s10	w_c10
w_s11		4w_s9		m_a11	S₁₁		4S₉		m_b11	w_s11
......		......					......

Table 4.7: Bit arrangement in two-level adders.

In order not to increase the cycle time when using radix-16 CSA the two paths shown in Table 4.8 should have the same delay. Therefore, the condition to be satisfied is:

t_R16 CSA £

t_HA + t_{SEL_{q_L}}

= 3.15 ns

The easiest way to implement this R16 CSA is by using a 4-bit carry-ripple adder. The corresponding delay, in our library, is

t_Sn = t_C₁ + 3t_ripple + t_HA = 0.8 + 3(0.35) + 0.5 ns @ 2.0 ns .

MSBs:	MULT	HA	SEL QL	REG
LSBs:	MULT	R16 CSA	R16 CSA	REG

Table 4.8: Paths in MSBs and LSBs in the recurrence.

Furthermore, the 7 most-significant bits assimilated in the selection function could be stored in the register Ws, saving 7 additional flip-flops. However, in the radix-16 case, the assimilated value must be selected among the 5 possible alternatives (see Figure 4.17) and this requires an additional multiplexer driven by q_H that increases the load both on q_H and q_L. For this reason the 7 MSBs bits are stored in carry-save representation.

In the original on-the-fly conversion and rounding algorithm the partial quotient is stored in two registers (Q and QM). By implementing the modified algorithm, with r = [2/3], only register Q (54 bits) and register C (14 bits) are needed. With the implementation of the modified algorithm the number of flip-flops in the convert-and-round unit is reduced from 108 to 69 and the power dissipated from 10.7nJ to 2.4nJ, resulting in a reduction of about 20% in the whole divider.

4.4.3 Dual Voltage Implementation

When applying dual voltage to the recurrence, the two cases with a radix-2 and a radix-16 CSA must be considered. As explained in Section 3.6, the time slack is longer for the radix-2 CSA implementation and the possible voltage V₂ is lower. In particular, for the radix-16 divider (critical path is 9.2 ns), we get the following values:

radix-2 CSA: t_slack = 4.6 ns Þ V₂ = 2.0 V Þ E_div = 22 nJ.
radix-16 CSA: t_slack = 1.8 ns Þ V₂ = 2.7 V. Þ E_div = 27 nJ.

The values of E_div indicated above take into account both the different number of flip-flops in register Wc and the voltage scaling in the convert-and-round unit.

It is clear that the implementation with radix-2 CSA is the most convenient for dual voltage.

4.4.4 Optimization with Synopsys Power Compiler

The synthesized selection function showed a delay of 4.0 ns through q_L, and 3.0 ns through q_H with delay constraints met. In addition the reduction in energy dissipation, obtained by incremental compilation with power dissipation constraints resulted to be about 25%.

4.4.5 Summary of Results for Radix-16

Table 4.9 reports the average energy dissipation and area for the standard and the low-power implementation, which is shown in Figure 4.21.

Figure 4.21: Low-power radix-16 divider.

	standard	low-power	synopsys	dual voltage	synopsys
blocks	nJ	nJ	(est.) nJ	(est.) nJ	(est.) nJ
control	0.5	0.5		0.5
clk tree	0.5	0.5		0.5
mux	2.6	0.4		0.4
mul. gen. H	2.5	1.6		0.8
CSA H	3.3	3.3		1.9
mul. gen. L	2.7	1.8		1.0
CSA L	5.0	4.3		2.5
sel. func.	5.9	8.2	6.1	8.2	6.1
register Ws	4.4	4.3		^*2.1
register Wc	4.2	1.6		^*1.9
register q_H	-	0.2		0.2
register q_L	-	0.2		0.2
SZD	3.7	0.6		0.6
C&R unit	10.7	2.4		^*0.9
E_div [nJ]	46.0	30.0	28.0	22.0	20.0
Ratio	1.00	0.65	0.60	0.50	0.45

Area [mm²]	2.2	1.8	-	-	-

Values marked ^* include level shifters.

Table 4.9: Energy-per-division for radix-16.

Figure 4.22 shows the breakdown, as a percentage of the total, of the energy dissipated in the main blocks composing the unit.

Figure 4.22: Percentage of energy dissipation in radix-16 divider.

In the std the largest part of energy is dissipated in the convert-and-round unit (about 25%). With the application of the energy reduction techniques, the energy dissipated in C&R unit is reduced to less than 10% of the total for l-p and less than 5% for d-v. On the other hand the contribution of the selection function to the total energy dissipation increases up to 35% of the total in d-v. By optimizing the selection function with Synopsys Power Compiler, a reduction of 25% in the block would reflect in a contribution of about 30% of the total in d-v.

File translated from T_EX by T_TH, version 1.1 and by M_E. Last Modified : Fri Jul 9 11:14:34 PDT 1999