# Standard CMOS Voltage-mode QLUT Using a Clock Boosting Technique

Diogo Brito, Jorge Fernandes, Paulo Flores, José Monteiro INESC-ID / Instituto Superior Técnico - TU Lisbon, Portugal {diogo.brito, jorge.fernandes, pff, jcm}@inesc-id.pt

Abstract—Interconnect has become preponderant in many aspects of digital circuit design, namely delay, power and area. This effect is particularly true for FPGAs, where interconnection is often the most limiting factor. Multiple-valued logic allows to reduce interconnections, within logic cells and between them, hence effectively mitigating the impact of interconnections. In this paper we propose a new look-up table structure based on a low-power high-speed quaternary voltage-mode device. Our quaternary implementation overcomes the drawbacks of previously proposed techniques by using a standard CMOS technology and a clock boosting technique to enhance speed without increasing consumption. Moreover, we present an ASIC prototype of a full adder based on the designed look-up table and experimental results are obtained and compared with simulation. The prototype is designed to work at 100 MHz and it consumes 128 µW.

#### I. INTRODUCTION

The improvements on integration techniques for systems on chip (SoCs) lead to an exponential reduction on the dimension of circuit elements, allowing better performance, power savings and a reduction of fabrication costs. Moreover, it permits the increase of the number of components in a circuit and its computational potential. This leads to a significant increase of the number and length of interconnections, which are now the dominant cause of delay in digital circuits, due to their resistance and capacitance. To compensate for these large interconnect parasites it has been suggested to compact the information applied at a single logic gate or signal path. Multiple-valued logic (MVL) was proposed as a solution [1], since a single wire carrying a signal with N logic levels can replace  $log_2(N)$  wires carrying binary signals.

Hence, there is a renewed interest in MVL as a means to reduce the number of wires, thus, reducing the node capacitance, therefore, increasing the speed and reducing the power consumption. In addiction, logic levels are closer to each other, thus smaller charges are needed for transitions, further reducing power consumption. Moreover, with less routing to be done, the used area is smaller.

Several solutions using MVL have been developed for adders, multipliers and programmable devices. However, the existing implementations either have high consumption, based on current-mode circuits [2], or require extra steps in the fabrication process to generate transistors with different threshold voltages [3].

Modern FPGAs are good candidates for MVL application, because this class of devices deal with a high number of interconnections. Moreover, some FPGA circuits maximum frequency of operation is not determined by the logic cells, but by the high load due to the interconnections. Furthermore, FPGAs commonly have large area and high power consumption.

In this work we present a quaternary look-up table (QLUT) that can be applicable in an FPGA context. We use only simple voltage-mode structures and a clock boosting technique to increase the speed on the used switches, being the whole circuit implemented in a standard CMOS technology. Results show that this solution overcomes the drawbacks of previous techniques.

This paper is organized as follows. Section II presents the guidelines for the circuit design, the circuit implementations and the test bench development. In Section III we show and discuss the experimental results. Finally, Section IV presents brief conclusions of this work.

#### II. VOLTAGE-MODE QUATERNARY LOOK-UP TABLE

The proposed implementation comes as a follow-up work to the look-up table (LUT) design presented in [4]. Some limitations were identified in this previous design and this new proposal intends to overcome them.

## A. Improved Quaternary Look-up Table

For this type of circuit to be applied in FPGAs it is required for it to have a drive capability, able to charge a capacitive load up to 14 pF on a predefined time [5]. The previous implementation was designed to work at 100 MHz with an output load capacitance of 2 pF, which limits its application on an FPGA context. To overcome this limitation, for our new proposal was considered the same operating frequency and an output load capacitance of 10 pF, which covers most cases of interconnect load that can be found on FPGA nodes.

Look-up tables can be seen as memories, usually composed by numerous switches which are actuated by control signals generated in a decoder. Each different function is implemented by changing the configuration of voltage levels applied to the input of these switches.

The switches are usually implemented with transmission gates based on NMOS and PMOS transistors. For a MOS transistor in the triode region, the resistance  $r_{ds}$  is given by (1).

$$r_{\rm ds} \simeq \frac{1}{\mu_{\rm n,p} C_{\rm ox} \frac{W}{L} (v_{\rm gs} - V_{\rm th})},\tag{1}$$

where  $\mu_{n,p}$  is the electric mobility,  $C_{ox}$  is the oxide capacitance, W and L are, respectively, the width and length of the transistor,  $v_{gs}$  is the gate-source voltage and  $V_{th}$  is the threshold voltage. For large signals (1) exhibits a non-linear behavior,



however it gives a first order guideline to dimension the switches.

Considering the load capacitance of 10 pF as a requirement, the one solution to decrease the charging time is to reduce the resistance on the signal path, which according to (1) can be obtained by using larger transistors. However, there is a trade-off, as the switch capacitance is proportional to  $W \times L$ , and larger transistors lead to higher power consumption to control them. This trade-off leads to the following reasoning on the switch design.

- Having two quaternary variables inputs in the LUT the natural solution would be to have two switches in series on the signal path. We have decided to use a single switch, transferring to the decoder an overhead in complexity. This overhead is largely compensated on terms of power consumption, because we are assuming that most of the power is used to charge the parasitic capacitances, which are largely dominant.
- 2) We have decided to use an NMOS due to its higher mobility ( $\mu_n > \mu_p$ ), therefore, less resistance than a PMOS can be achieved, considering equally sized transistors. This technique allows also to reduce area, as it does not require a complementary control signal. In addiction, once we use a smaller transistor as switch, the necessary energy to turn it ON and OFF is smaller.
- 3) However a transmission gate implemented solely with an NMOS transistor does not have full swing dynamic range. To overcome this limitation a clock boosting technique is used which allows full swing dynamic range and further reduces the switch ON resistance by increasing  $v_{gs}$  in (1). The area and power consumption of the clock boosting circuit [6], depicted in Fig. 1(a), is small, because it only uses minimum dimension transistors and a small capacitor. The capacitor is full charged only on power-up, from then on the only required recharge is due to charge sharing among the capacitor and the switch transistor, therefore it should not be accounted as a clock boosting circuit power consumption overhead.

Considering the model depicted in Fig. 1(b) it is possible to obtain the required resistance value by using (2).

$$v_{\rm out} = v_{\rm in} \left( 1 - e^{-\frac{t}{R_{\rm on}C}} \right) \tag{2}$$

Following these guidelines a QLUT is developed, results are drawn and we expect power consumption and area savings, maintaining the operating frequency.



Fig. 2. Half adder simulation results at 100 MHz.

# B. QLUT Half Adder

To convert the quaternary input signals  $(0_4, 1_4, 2_4, 3_4)$  to binary  $(0_2, 1_2)$ , self-referenced comparators, which can be seen in [4], are used. Fig. 1(c) shows these comparators CP, CI and CN, having, respectively, the reference values  $1/6V_{DD}$ ,  $3/6V_{DD}$ and  $5/6V_{DD}$ . Using two decoders to convert the two quaternary input signals to binary it is possible to implement the sixteen control signals according to the coding presented in Table I, based on combinatory logic. All the gates are full-custom designed in standard CMOS with minimum dimensions transistors in order to have high speed with reduced overhead in area and consumption.

The switch is developed for a rise time to be 15% of the period of the target operating frequency (100 MHz).

With this QLUT a half adder is implemented by using the voltage reference configuration presented in Table II and the topology in Fig. 1(d), leading to the simulation results shown in Fig. 2.

To evaluate the performance the rise and fall times are



Fig. 3. QLUT based full adder.

measured for every different transition, the obtained results are presented in Table III and the measured power consumption was  $33.8 \ \mu$ W.

In [4] we presented results for two circuits with a similar functionality, one binary, consuming 45  $\mu$ W and one quaternary, consuming 35  $\mu$ W while applied to a capacitive load of 0.2 pF, and these circuits were limited to a maximum load of 2 pF. The proposed implementation has about the same power consumption , while working at 100 MHz, however, it is able to deal with loads of 10 pF, which reveals gains in terms of functionality and application on FPGA circuits.

To attest the robustness of this QLUT for process and mismatch variations, the circuit was simulated successfully for 500 Monte Carlo runs.

# C. QLUT Full Adder

To design a full adder the QLUT presented in Section II-B has to be modified in order to include the carry in input. The proposed full adder is illustrated in Fig. 3, making use of two QLUTs with small modifications and a two-to-one multiplexer, which also makes use of the developed switches.

Table IV shows the full adder logic function, from where we can observe that the results for the *Sum* and  $C_{OUT}$  can be obtained performing shifts according to the  $C_{IN}$  value. Our solution makes use of two QLUTs configured with the *Sum* and  $C_{OUT}$  results when  $C_{IN}$  is 0<sub>4</sub>. Performing shifts on the input variables,  $Q_A$  or  $Q_B$ , when  $C_{IN}$  is 1<sub>4</sub> the results are obtained. This solution saves the use of another two QLUTs, which would be required to store the configuration when  $C_{IN}$ is 1<sub>4</sub> and another multiplexer to select the correct output for each situation.

The required logic to perform the shifts and to decode the  $C_{\rm IN}$  is added to the QLUT. These blocks are implemented with standard CMOS logic gates, with minimum dimensions transistors, to avoid penalizing area, consumption and delay.

The shift block is implemented for one of the inputs and performs a single up shift on the quaternary value. Using the two QLUTs configured with the results when  $C_{IN}$  is 0<sub>4</sub>, as

| TABLE IV                  |  |  |  |  |  |  |  |  |
|---------------------------|--|--|--|--|--|--|--|--|
| ULL ADDER LOGIC FUNCTION. |  |  |  |  |  |  |  |  |

F

| $Q_{\rm A}$ | $Q_{\rm B}$ | $C_{\rm IN}$ | S um | COUT    | $Q_{\rm A}$ | $Q_{\rm B}$ | $C_{\rm IN}$ | S um    | $C_{\rm OUT}$ |
|-------------|-------------|--------------|------|---------|-------------|-------------|--------------|---------|---------------|
| 04          | $0_4$       | $0_{4}$      | 04   | 04      | 04          | $0_4$       | 14           | 14      | 04            |
| $0_4$       | $1_{4}$     | $0_4$        | 14   | 04      | $0_4$       | $1_{4}$     | $1_{4}$      | 24      | 04            |
| $0_4$       | 24          | $0_4$        | 24   | 04      | $0_4$       | 24          | 14           | 34      | 04            |
| $0_4$       | 34          | $0_4$        | 34   | 04      | $0_4$       | 34          | $1_{4}$      | $0_{4}$ | 14            |
| 14          | $0_4$       | $0_4$        | 14   | 04      | 14          | $0_4$       | $1_{4}$      | 24      | 04            |
| $1_{4}$     | $1_{4}$     | $0_4$        | 24   | $0_{4}$ | 14          | $1_{4}$     | $1_{4}$      | 34      | $0_4$         |
| 14          | 24          | $0_4$        | 34   | 04      | 14          | 24          | 14           | 04      | 14            |
| 14          | 34          | $0_4$        | 04   | 14      | 14          | 34          | $1_{4}$      | 24      | 14            |
| 24          | $0_4$       | $0_4$        | 24   | 04      | 24          | $0_4$       | 14           | 34      | 04            |
| 24          | $1_{4}$     | $0_4$        | 34   | 04      | 24          | $1_{4}$     | $1_{4}$      | $0_{4}$ | 14            |
| 24          | 24          | $0_4$        | 04   | 14      | 24          | 24          | 14           | 14      | 14            |
| 24          | 34          | $0_4$        | 14   | $1_{4}$ | 24          | 34          | $1_{4}$      | 34      | $1_{4}$       |
| 34          | $0_4$       | $0_4$        | 34   | $0_{4}$ | 34          | $0_4$       | 14           | 04      | 14            |
| 34          | $1_{4}$     | $0_4$        | 04   | $1_{4}$ | 34          | $1_{4}$     | $1_{4}$      | 14      | $1_{4}$       |
| 34          | 24          | $0_4$        | 14   | 14      | 34          | 24          | 14           | 24      | 14            |
| 34          | 34          | 04           | 24   | 14      | 34          | 34          | 14           | 34      | 14            |

| TABLE V                                      |
|----------------------------------------------|
| Full adder time analysis results at 100 MHz. |
|                                              |

| Logic level transition                                       | Rise Time [ns] | Fall Time [ns] |
|--------------------------------------------------------------|----------------|----------------|
| $0_4 - 1_4 (0 - 0.44 \text{ V})$                             | 1.19           | 1.14           |
| $0_4 - 2_4 (0 - 0.71 \text{ V})$                             | 1.03           | 0.79           |
| $\mathbf{1_4} - \mathbf{2_4} \ (0.44 - 0.71 \ \mathrm{V})$   | 1.57           | 0.96           |
| $0_4 - 3_4 (0 - 1.2 \text{ V})$                              | 0.80           | 0.76           |
| $1_4 - 3_4 (0.44 - 1.2 V)$                                   | 1.54           | 0.99           |
| <b>2</b> <sub>4</sub> - <b>3</b> <sub>4</sub> (0.71 - 1.2 V) | 1.55           | 1.04           |
|                                                              |                |                |

presented in Table IV, it is seen that the results with  $C_{IN}$  logic level 1<sub>4</sub> can be obtained, for the *Sum*, by shifting  $Q_B$  and for the  $C_{OUT}$  by shifting  $Q_A$ , with one exception when  $Q_A$  has the level 3<sub>4</sub>, which is handled by combining the  $C_{IN}$  and  $Q_A$ to control the multiplexer (MUX).

In Fig. 4 it is shown that the implementation works as expected. This conclusion can be further confirmed with the results present in Table V. These results are for the worst cases found on the outputs *Sum* or  $C_{OUT}$ . When compared, with those of the previous QLUT (Table III), it is noticed that the only difference is on the transition from level  $0_4$  to  $1_4$ , this is due to the chain of NMOS transmission gates used to pass the  $C_{OUT}$  signal, one from the QLUT and other from the MUX.

The power consumption for this modified QLUT is 36.5  $\mu$ W which reveals an overhead of 8% in consumption, facing the QLUT presented in Section II-B.

The results presented in Table VI, corresponding to the power measured on  $V_{DD}$ , confirm that this approach, with minimal modifications in the QLUT design, is feasible with advantages on power consumption. This full adder design saves power consumption when compared with the direct implementation with four QLUTs which would consume, at least 135.2  $\mu$ W (4 × 33.8  $\mu$ W) and would have a larger area occupation.

#### D. Test Bench Development

To validate the proposed circuit a layout was designed in UMC 130 nm CMOS technology. The main concern is the active area of the QLUT, once the purpose of this module is to be replicated millions of times on an FPGA. The active die area of the full adder is  $108 \times 99 \ \mu\text{m}^2$ . On Fig. 5(a) is a photograph of the fabricated circuit under evaluation.

The designed full adder is wire bonded directly on a preliminary test printed circuit board (PCB). The test board is



Fig. 4. (a) Full adder simulation result at 100 MHz (a) inputs and (b) outputs.

developed to provide the supply voltage (1.2 V), the reference voltages (1.2, 0.71 and 0.44 V) and the three quaternary input signals ( $Q_A$ ,  $Q_B$  and  $C_{IN}$ ).

All the signals required for testing are generated on an FPGA, with some added resistor elements. The quaternary reference voltages are obtained, using variable resistors that can be adjusted to fit the wanted values. The custom high frequency digital signals are generated directly in the FPGA making it easy to change them simply by programming. Since the FPGA outputs are binary valued, we applied them to a resistor network to produce the custom quaternary inputs suited for the test. A photograph of the test environment is depicted in Fig. 5(b).

#### **III. EXPERIMENTAL RESULTS**

Using the described test bench, we are able to perform preliminary tests on the circuit.

For better observation it was opted to present a test at half the frequency with the combination of three quaternary signals,  $Q_A@50$  MHz,  $Q_B@25$  MHz and  $C_{IN}@12.25$  MHz presented in Fig. 6.

The results for the power consumption measured on  $V_{\rm DD}$ , both for simulation and experimental are presented in Table VI, for two different frequencies (50 and 100 MHz). Comparing the simulation with the experimental obtained values we can conclude that these results are in the same order of magnitude, being the difference due to a preliminary test setup with large parasitics.

These results show that the binary to quaternary circuit is feasible and efficient in terms of power consumption while being implemented in a standard CMOS technology with a single type of NMOS and PMOS transistors and a single supply voltage. Furthermore the use of the clock boosting technique proved to be more efficient than the use of regular transmission gates, when it is required to charge higher loads.

### IV. CONCLUSION

In this work we have reported the design and the experimental results performed on an actual ASIC implementation of an innovative quaternary look-up table design making use of a clock boosting technique. This circuit can be used in combinatory logic or as a building block in FPGAs. Simulation and experimental results have shown advantages of this



Fig. 5. (a) Die and (b) test bench overview photograph.



Fig. 6. Full adder experimental result at 50 MHz (a) inputs and (b) outputs.

quaternary implementation facing previous equivalent circuits, regarding power consumption and operating frequency with a load of 10 pF. The proposed design is a valid solution to reduce the interconnections impact, without increasing power consumption or losing functionality. The obtained results attest for the feasibility and favorable characteristics of the design. We are currently working on improving the used test board, in order to improve the consumption measurements and, it is also expected to turn possible a better visualization of the circuit output signals at higher frequencies.

#### ACKNOWLEDGMENT

This work was supported by national funds through FCT-Fundação para a Ciência e Tecnologia, under projects Pest-OE/EEI/LA0021/2011 and QCell (EXPL/EEI-ELC/1016/2012).

#### References

- [1] S. Hurst, "Multiple-valued logic Its Status and Its Future," *IEEE Transactions on Computers*, vol. c, no. 12, 1984.
- [2] K. Current, "Current-mode cmos multiple-valued logic circuits," *IEEE Journal of Solid-State Circuits*, vol. 29, no. 2, pp. 95 –107, feb 1994.
- [3] R. da Silva, H. Boudinov, and L. Carro, "A novel Voltage-mode CMOS quaternary logic design," *IEEE Transactions on Electron Devices*, vol. 53, no. 6, pp. 1480 – 1483, june 2006.
- [4] C. Lazzari, J. Fernandes, P. Flores, and J. Monteiro, "An Efficient Low Power Multiple-Value Look-Up Table Targeting Quaternary FPGAs," in *Integrated Circuit and System Design. Power and Timing Modeling, Optimization, and Simulation*, ser. Lecture Notes in Computer Science, R. van Leuken and G. Sicard, Eds., 2011, vol. 6448, pp. 84–93.
- [5] J. H. Anderson and F. N. Najm, "Power Estimation Techniques for FPGAs," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 12, no. 10, pp. 1015–1027, 2004.
- [6] T. G. Rabuske, C. R. Rodrigues, and S. Nooshabadi, "A 5MSps 8-bit SAR ADC with single-ended or differential input," *Microelectronics Journal*, vol. 43, no. 10, pp. 680 – 686, 2012.