Vol. No.4, Issue No. 06, June 2016 www.ijates.com # ENDOCRINE: A NEW METHODOLOGY FOR SELF HEALING ADVANCED DIGITAL SYSTEMS ## Prajeesh.P<sup>1</sup>, Jasmin Basheer<sup>2</sup> <sup>1</sup>Department ECE, Sree Buddha College of Engineering (India) <sup>2</sup>Department of ECE Sree Buddha College of Engineering (India) #### **ABSTRACT** Self-repairing digital systems have recently emerged as the most promising alternative for fault tolerant systems. However, such systems are still impractical in many cases, particularly due to the complex rerouting process involved once the system turns faulty. Digital Systems lose efficiency when the circuit size increases, due to the extra hardware. Within the concept of Embryonic is a concept called Endocrine cellular communication system which is the idea behind the proposed work. The work is trying to implement the endocrine concept in to the Digital circuit world to achieve the best possible fault tolerance. The proposed idea is that, in the digital circuit each working unit is surrounded by two spare unit and if any of the working unit becomes faulty, the same will be replaced with a spare unit. The selection of the spare unit is based on as priority set by the system itself. Hence the goal of this work is to mainly focus on overcoming challenges and implement a platform model with fault tolerance techniques. The intended platform model would be used in very large-scale integrated circuits capable of self-repair and self-replication. Keywords: Rerouting, Embryonic, Endocrine, Fault tolerance, Self-repair, Self-replication #### I. INTRODUCTION In electronic industry reliability and quality of an embedded product is an important thing. If an electronic product is not working properly these two things will goes down in the market. Self-healing digital systems have recently emerged technique for fault-tolerant systems. But such system are impractical in many cases, due to the hardware size and context switching problems. In electronic hardware industry the complexity of designing the VLSI (Very Large Scale Integration) based digital system has been increasing day by day, systems become more susceptible to faults especially transient faults. In such kind of design it is impossible to detect and tolerate the fault in the system. The only way to improve the reliability and quality of the system by built-in fault tolerant mechanisms to recover from such faults. High quality verification and testing is a vital step in the design of a successful embedded microprocessor product. Designers must verify the correctness of large complex systems and ensure that manufactured parts work reliably in varied operating conditions. If successful, users will trust that when the Vol. No.4, Issue No. 06, June 2016 #### www.ijates.com SSN 2348 - 7550 processor is put to a task it will render correct results. If unsuccessful, the design falter, often resulting in serious repercussions ranging from bad press, to financial damage, to loss of life. Rui Gong and Kui Dai [1] says that to reduce the inter-core Communication bandwidth demand, two new approaches, Dual Core Redundancy (DCR) and Triple Core Redundancy (TCR), are proposed for fault tolerance. In DCR, only store instructions are compared before commit, so that the bandwidth demand can be largely reduced. And the fault recovery is achieved by context saving and recovery. While TCR applies Triple Modular Redundancy (TMR) in the core level to efficiently exploit the core resources of CMPs for transient fault masking. In TCR, only the results of store instructions are compared to detect transient fault and reduce the inter-core communication bandwidth demand. Once detecting a Single Event Upset (SEU), TCR can be reconfigured to execute with the two uncorrupted cores for fault detection. However these methods have several problems. One of them being, the size of the module is so huge that a large part of the circuit must be replaced even if a small part in the module is malfunctioning. From Human biological system a new concept called *Endocrine* is found which gives the initial startup of the work. One of the specialty of such endocrine system is that if any of the cell is damaged it will be replaced with adjacent cell and hence the communication between the cells in human body is re-established. The main reason of the proposed work is to overcome the challenges faced by customers. The embedded system manufacturing companies do not provide any chip level service when the System gets damaged. They replace the entire PCB instead of replacing that particular faulty IC. The replacement of an IC with more than 100 port pins is an extremely difficult task. We expect this new approach to dramatically improve the performance of Digital Electronics circuits and reduce the size of hardware without any loss of quality. #### II. OVERVIEW OF THE PROPOSED SYSTEM Based on the Endocrine concept an ALU is designed. The proposed system consist of two layers wiz Functional layer and supervisory layer. As illustrated in the Fig. 1the functional layer comprises of working cell unit (WC) and stem cell unit (SC), supervisory layer consist of index changing unit (ICU). Each working cell unit in a functional layer performs a specific function of ALU (like Addition, Subtraction, multiplication and Shifting) while the stem cell unit is capable of performing all the function of the same. The selection of the working cell unit depends on the address selection bit. The structure of each cell is identical the only difference between the cell is the code in the cell (call it as genome), i.e. the code with in the cell makes the difference between each cell. Every WC units has two neighboring SC units and in case of any fault occurrence the faulty WC can be replaced by any available SC. In the supervisory layer, the index changing unit (ICU) takes the charge of one WC and its two neighboring SCs in the functional layer. To avoid collision it choose the proper candidate SC for the faulty WC. From the layer structure, ICU 1 (Yellow colour) controls the operation of WC 1 and its surrounding two stem cells SC1 and SC2 (yellow colour in the functional layer). SC2 in the functional layer is common for ICU1 and ICU 2 there for it is represented by brown colour in the functional layer. Supervisory layer also contains the brown colour Vol. No.4, Issue No. 06, June 2016 #### www.ijates.com which means that both ICU 1 and ICU 2 can control the operation of the SC1 stem cell. The intermediate colour between all the stem cells and working cell are defining this cell operation sharing. #### III. FUNCTIONAL LAYER ARCHITECTURE #### 3.1 Working cell unit architecture As illustrated in the Fig: 2 the working cell consist of a functional unit and fault detection unit. The functional unit as the name suggest performs any of the predefined ALU operation like Addition, Subtraction, Multiplication and Shifting. #### 3.1.1Fault detection unit CUT represents the Circuit under Test (That is the ALU functional unit). The CUT takes A, B and test sig as inputs also inputs from ROM1 goes to CUT. It gives two outputs namely to the fault detector (FD) and to the Fault Signal Checker (FSC). The FSC takes the input from the CUT as well as from the FDU and produce the actual WC output as well as fault signal. ROM1 has a two dimensional memory consisting of a predefined 8 bit input values generated by LFSR. The fault detector is a 16 bit comparator which compare the output from CUT and ROM2. Rom2 has a one dimensional memory consisting of predefined correct output based on the input value stored in the ROM1 (Value inside the ROM2 is stored depending upon the WC). The FSC which check whether the output of the FD is high or not. If it is high, FSC provides a fault Signal and also isolate the WC output. #### 3.1.2 Fault detection unit algorithm Step by step algorithm is given below - 1. If test\_sg=1 (Advance mode of operation). - 2. CUT will accept only the input from the ROM1. - 3. Result of the CUT operation is fed to FD and another output is fed to FSC. - 4. CUT output is compared with the data from the ROM 2 buffer for fault detection. - 5. If an inequality then a fault signal is generated. - 6. In FSC, if fault signal=1 then all output from the working cell will be tri-stated and a fault signal is propagated to ICU unit. - 7. If the test\_sg=0 (Normal mode of operation). - 8. CUT will only accept the current ALU inputs and generate the output. - 9. During the initial stage test\_sg always equal to 1 for a period of time. Vol. No.4, Issue No. 06, June 2016 www.ijates.com Fig.1. Layer Structure of the architecture #### 3.2 Stem cell unit architecture The Stem Cell has its architecture similar to that a working cell with the difference that a SC performs all the function like Addition, Subtraction, Multiplication and Shifting. Whereas in WC only one of the above. Fig. 2. Working Cell Vol. No.4, Issue No. 06, June 2016 www.ijates.com #### 3.3 Routing architecture To realize the proposed system routing architecture the WCs and SCs are arranged as shown in Fig. 4. From Fig. 4it is clear that one SC is common for two WCs and one WC is common for two SCs. When the WC becomes faulty it will generate a fault signal which is fed to the ICU unit. ICU unit will find out the available free stem cell for replacement by using the control signal from the supervisory layer. Fig. 3. Fault Detection Unit #### IV. SUPERVISORY LAYER ARCHITECTURE In the proposed system, a WC can be replaced by any of its two neighboring SCs for fault tolerance. Thus the Supervisory layer must control the functional layer properly without any collision. The ICU in the supervisory layer takes the charge of every WC and its two neighboring SCs. The proposed system contains a register bank which store the status of all the stem cells in the functional layer. The cell replacement mechanism is done on the basis of these register bank shown in Fig. 4(These registers are not for general Purpose usage). Every ICU receives a fault signals from two neighboring SCs as well as the WC. When a SC or WC is faulty then the index bits of corresponding SC or WC in the register bank are changed by ICU. Because the ICU control the two neighboring SCs for replacement. It can isolate a WC and replace it with another SC. Every WC and SC has its own index bits in register bank and they show the status of the each cell in top layer of the architecture. Three type of index bits in the register bank are state bit, direction bit and differentiation bit. During the system start-up time all these state bits are cleared to zero. These state bit shows whether the stem cell is available or Vol. No.4, Issue No. 06, June 2016 #### www.ijates.com not for the replacement. The direction bit represents the direction of the stem cell (zero represents left side of the WC and one represents the right side of the WC). The differentiation bit is used for the isolation purpose. Fig.4. Routing architecture and Register bank #### 4.1 Index changing unit Working of the ICU can be explain by using Fig. 4 and Table I. From the table W represents the workingcell unit and LS, DS are the left and right stem cells. The responsibility of the ICU is to change the index bit of the two SCs and WC. The priority order of the cell replacement is done in counterclockwise direction by enabling the selection bit of the SCs. Table 1 illustrates all the possible changes of index bits in W,LS and RS after the receipt of the fault signal from the WC and SCs. On the first line on the table fault signal of W is one and the differentiation bit of LS is set to one only if the state bit of the LS is zero. Here the state bit of the LS indicates that, LS is the first available stem cell for the cell replacement with faulty W. During first stage the direction bit need not be change because the initial value already represents the left direction. In the second line of the table fault signal of W is one and the state bit of the LS is also one which means that W becomes faulty and is replaced by DS not LS because LS is occupied by another WC. Hence the algorithm sets the system to skip the faulty cells. The system also contains one fault correction unit for fault correction procedure which is done by multiple EXOR operation. #### V. FAULT CORRECTION UNIT These are the four working cells developed for each ALU operation. The input is given to working cell as well as stem cells. The fault is detected by fault detection unit. The fault bit position can be located by xoring the wrong WC Vol. No.4, Issue No. 06, June 2016 #### www.ijates.com SSN 2348 - 7550 output and expected correct outputs. It can be recovered by xoring the wrong output that is by doing the multiple EXOR operation. E.g. If WC becomes faulty and the WC will be replaced with SC that is now SC behave like as WC and gives the correct output. By xoring both WC faulty output and SC correct output, if the result of xoring is zero then no error is detected (This is the one simplest method for fault detection). If the result is not zero then error can be located by looking the 1 position. For that purpose the proposed system use a Bit locator circuit which gives a count value represents the number of faulty bits. | The condition for the change | | | State before fault | | | State after Fault | | | | | | |------------------------------|----|----|--------------------------|----|----|---------------------|----|----|---------------|----|----| | Fault signal | | | State bit | | | Differentiation bit | | | Direction bit | | | | W | LS | RS | W | LS | RS | W | LS | RS | W | LS | RS | | 1 | | | | 0 | | | 1 | | | 0 | | | 1 | 1 | | | 1 | 0 | | | 1 | | | 1 | | 1 | | 1 | | 0 | 1 | | 1 | | | 0 | | | 1 | 1 | 1 | Fault correction enabled | | | | | | | | | Table 1 ICU OPERATION #### VI. ALU OPERATION BASED ON INPUT ADDRESS All the operation performed by the ALU is based on the 2 bit input address bit. Each operation of the ALU is assigned to respective working cell. The Table II shows the address bits and operation performed by the ALU and WC which is selected for the operation. Table 2 ALU operation based on address bits | Address | Operation | Selected | | | | |---------|----------------|--------------|--|--|--| | | Performed | working cell | | | | | 00 | Addition | WC1 | | | | | 01 | Subtraction | WC2 | | | | | 10 | Multiplication | WC3 | | | | | 11 | Shifting | WC4 | | | | Vol. No.4, Issue No. 06, June 2016 www.ijates.com #### VII. OVERALL ARCHITECTURE OF THE PROPOSED SYSTEM The proposed system consist of a 2:4 address decoder which takes 2 bit inputs (00, 01, 10, and 11) shown in Fig. 5. It also consists of two 8 bits inputs which is being given to ALU. Firstly the address decoder takes input E.g. "00" during which becomes active and perform the addition operation because WC1 is configures as addition working module of the ALU. Secondly, when the input from the adders decoder is "01" and the WC2 becomes active and performs the subtraction operation. During the third and fourth phases "10" and "11" become the input from the decoder by which WC3 and WC4 is active performing multiplication and shifting operation. If WC1 becomes faulty the FDU becomes active and generate fault signal which in-turn activates the ICU unit. In this scenario the faulty WC1 is replaced by either SC1 or SC2 whichever is free. The same method is applies for other WC. Fig. 5. Structural view of the system #### VIII. RESULT ANALYSIS The proposed system is designed using Xilinx ISE Design suite. By using software one more working cell is designed named as faulty cell for fault analysis. In the first phase the simulation has been done by using the correct working cell and verifies the output with expected output. During the second phase the correct working cell is replaced with a faulty working cell and Test\_sig is enabled. Verification of the second phase output also done successfully. Fig. 6 shows the RTL schematic view of the proposed system. From the RTL schematic the fault\_sig\_out will be active high only if entire system becomes fault. Vol. No.4, Issue No. 06, June 2016 #### www.ijates.com Fig.7 shows the output simulation of adder working cell without any fault and Fig. 8 shows the adder working cell with fault corrected by cell replacement, at that moment ICU will be activated (Fig.9) and which enabled the right stem cells (working cell and left stem cell are treated as faulty cell) indicated by "enable 01" in Fig.9. Fig.6 RTL Schematic of the final System From Fig. 9 it is clear that when a fault is detected in adder working cell the ICU unit will updates the index bits in the register bank for cell replacement. ICU first check the state bit of the LS (INDEX\_STATEBIT\_LS in Fig. 9). From the simulation window (Fig.9) it is clear that state bit of the LS is one. Fig.7. Simulation of Adder working cell without any fault Which means that the LS is in busy state or it become faulty stem cell. ICU unit will change the DIRECTION\_BIT to "01" that mean it enables the right stem cell (RS) and generate an isolation signal for the left stem cell in order to isolate the LS from the system. Vol. No.4, Issue No. 06, June 2016 #### www.ijates.com ISSN 2348 - 7550 Fig. 8 shows the output of the right stem cells, where the enable pin gets the input from the ICU which is the DIRECTION\_BIT "01". It will enables the RS for continues operation. Same time ICU unit also generate the isolation signal for faulty working cell in order to isolate it. All these happens only if the fault signal is obtained from the fault detection unit in the functional layer. Fig.8 Corrected output Fig. 9. ICU Output #### IX. CONCLUSION In this paper, a new self-repairing architecture which provide good scalability and fault coverage was proposed. The architecture composed of two layer structure top layer called functional layer which consist of functional units like WCs and SCs. Bottom layer call Supervisory layer which supervise the overall functioning of the functional layer. Bottom layer consist of ICU unit which is the heart of the system. ICU control the proper assignment of SC for the Vol. No.4, Issue No. 06, June 2016 #### www.ijates.com SSN 2348 - 7550 replacement of a faulty cell. New architecture in circular shape which help to reduce the hardware size and improves the performance. As a result, all these make the system efficient. #### **REFERENCES** - [1] Rui Gong, Kui Dai, Zhiying Wang, "Transient Fault Tolerance on Chip Multiprocessor based on Dual and Triple Core Redundancy," 14th IEEE Pacific Rim International Symposium on Dependable Computing,vol.24, no. 6, pp. 22-29, 2008. - [2] Mohammad Salehi and AlirezaEjlali "A Hardware Platform for Evaluating Low-Energy Multiprocessor Embedded Systems Based on COTS Devices," *IEEE Trans. On industrial Electronics.*, vol. 62, no. 2, pp. 1262–1269, Feb. 2015. - [3] C. Ortega and A. Tyrrell, "Design of a basic cell to construct embryonic arrays," *IEE Proc. Comput. Digital Tech.*, vol. 145, no.3, pp. 242–248, May 1998. - [4] D. Mange, E. Sanchez, A. Stauffer, G. Tempesti, P. Marchal, and C. Piguet, "Embryonics: A new methodology for designing fieldprogrammable gate arrays with self-repair and self-replicating properties," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 6, no. 3, pp. 387–399, Sep. 1998. - [5] P. K. Lala and B. K. Kumar, "An architecture for self-healing digital systems," *J.Electron. Testing: Theory Appl.*, vol. 19, no. 5, pp. 523–535, Oct. 2003. - [6] X. Zhang, G. Dragffy, A. G. Pipe, N. Gunton, and Q. M. Zhu, "A reconfigurable self-embryonic cell architecture," in *Proc. ERSA*, Jun.2003, pp. 134–140. - [7] M. Samie, G. Dragffy, A. Popescu, T. Pipe, and C. Melhuish, "Prokaryotic bio-inspired model for embryonics," in *Proc. NASA/ESA Conf.Adapt. Hardw. Syst.*, Jul.—Aug. 2009, pp. 163–170. - [8] M. Samie, G. Dragffy, and T. Pipe, "Bio-inspired self-test for evolvable fault tolerant hardware systems," in *Proc. NASA/ESA Conf. Adapt.Hardw. Syst.*, Jun. 2010, pp. 325–332. - [9] M. Samie, G. Dragffy, and T. Pipe, "UNITRONICS: A novel bioinspired fault tolerant cellular system," in *Proc. NASA/ESA Conf. Adapt.Hardw. Syst.*, Jun. 2011, pp. 58–65. - [10] J.-M. Moreno, Y. Thoma, E. Sanchez, O. Torres, and G. Tempesti, "Hardware realization of a bio-inspired POEtic tissue," in *Proc.NASADoD Conf. Evolvable Hardw.*, 2004, pp. 237–244. Vol. No.4, Issue No. 06, June 2016 #### www.ijates.com ijates ISSN 2348 - 7550 [11] Jing Huang, Mehdi Baradaran Tahoori, and Fabrizio Lombardi, "Fault Tolerance of Switch Blocksand Switch Block Arrays in FPGA," *IEEE Trans. Very Large Scale Integr. (VLSI)* VOL. 13, NO. 7, JULY 2005