Evaluation of the HDLRuby Hardware Description Language by implementing an 8-bit RISC Processor

HDLRuby is a new hardware description language (HDL) based on Ruby created for improving the productivity of HW designers. This paper presents a study of the implementation from scratch with HDLRuby of an 8-bit RISC processor called MEI8. This implementation required only little effort and its code is more than twice shorter than the equivalent VHDL code. The resulting processor was mapped onto a Virtex7 FPGA where it ran at 100MHz and was estimated to run at 28MHz when implemented as a 0.5μm IC.


Introduction
Register-transfer level (RTL) is the long de-facto model used for describing HW. Yet, the design productivity of RTL stagnating, huge efforts have been spent to create new methods for synthesizing HW from more and more abstract representations. Yet, these methods are still limited, and in the end, designers still heavily rely on RTL synthesis. Thence, the time-to-market requirements pushed to adopt processor-centric devices even though their energy and power efficiency is much poorer than pure HW (1,2) .
When observing the academic and industrial works for improving HW design, it can be noticed that the main goal has been to use more and more SW-like models for describing HW. This culminates with High Level Synthesis (HLS) (3) that tries to synthesize HW directly from SW code. The idea of getting closer to SW is indeed attractive, but it proved to be difficult to synthesize efficient HW using SW-oriented models of computation (4) .
Conversely, even though the design productivity of SW increased a lot, its model of computation remained mostly unchanged, based on the sequential execution of imperative instructions. Therefore, it can be advocated that the choice of model of computation may not be relevant for improving the productivity. In this context, we proposed HDLRuby (5,6) , an RTL-based HDL built upon the Ruby programming language (7) for including paradigms independent of the model of computation that are efficient in SW design. Namely, we focused on the followings: object-orientated programming, genericity, metaprogramming and reflection. It must be noticed that the proposed approach is orthogonal with the traditional ways of abstracting HW, and that HDLRuby can be extended to support HLS-like algorithms.
The goal of this paper is to evaluate the benefits of using HDLRuby for designing from scratch a full-fledged circuit. For that purpose, the paper presents the implementation from scratch of a full processor with HDLRuby, compares its complexity with the corresponding VHDL implementation, evaluates the effort that has been required for obtaining the final result, checks its validity on an FPGA and an IC targets and estimates the final design productivity in gates. The processor is called MEI8 and is an 8-bit RISC Harvard processor including 8 general purpose registers, 37 different instructions and 2 external interrupt ports. In the present implementation, the instruction memory is an on-chip 256-bytes ROM. More details about this processor, and the corresponding source code are available online (8) .
The rest of the paper is organized as follows: section 2. presents some related works, section 3. presents the HDLRuby language, and section 4. details the study and its result and gives a discussion about its significance. Finally, section 5. concludes the paper.

Related Works
A main improvement in HW design has been the adoption about 25 years ago of the RTL model of computation (9) . While successful, RTL design is still much more time consuming than SW design. For this reason, tremendous efforts have been spent for improving further the design productivity of HW. Among these efforts we can cite: the early works on behavioral synthesis (10) , trying to synthesize HW from clock-free sequential code, component-based design (11,12) generating wrappers for low-level HW/SW components allowing easy composition of complex systems, the introduction of SW-based HDL like SystemC (13) or SpecC (14) , and the recent efforts for generating HW directly from SW code, like HLS approaches (3) or HW synthesis from Matlab models (15) . Most of these approaches have in common that they try to synthesize HW using a model of computation closer to SW than to HW, which proved to be difficult in practice (4) . By contrast, with HDLRuby we do not try to change the model of computation, but instead we focus on improving the quality of the RTL code.
A few approaches are closer to ours. Some works tried to introduce object-orientation to HW design. For instance, SystemVerilog (16) and SystemC (13) include classes and high-level control constructs but limited to data types or non-synthesizable code. SystemC is also remarkable from being implemented on top of the C++ SW programming language. More recently, P. Tomson (17) presented the draft of a HDL based on the Ruby language. However, this latter language did not evolve past the proof of concept.
There is little work about evaluating the productivity of HDL. The usual approach is to provide metrics like the number of gates produced per designer a day (1,2) . Such metrics are however product-dependent and give reliable results only when applied on a wide range of designs. For estimating a single design as it is the case of this paper, more relevant metrics can be found in SW design, e.g., source lines of code (SLOC), cyclomatic complexity (18) , or code churn (19) . Section 4.1 gives more details about the metrics considered for this paper.

The Core of HDLRuby
HDLRuby (5) is a HW description language based on the Ruby (7) programming language. A preliminary version of this language has also been presented in 2018 (6) . The goal is to increase the productivity of the HW designers by adapting to HW successful SW paradigms not or only partially used in existing HDL like VHDL or Verilog HDL, while ensuring that the language remains synthesizable register transfer level (RTL). The main paradigms imported to HDLRuby are the followings: • object-oriented programming: the elements of a description are considered as objects, i.e., collections of attributes and methods (algorithms), that interacts through messages. • generic programming: elements of a description can be parameterized and reused in different contexts. • reflection: the elements of a description can examine and modify their own structure and behavior. • metaprogramming: code can be treated as data and be generated during execution, in our case, portions of code can be used as parameters for other elements of the description. For that purpose, HDLRuby has been designed as a two-level language: a high-level generative language, used by the designer, whose execution produces a low-level set of data structures representing RTL constructs. The high-level generative layer is implemented on top of the Ruby programming language. In term of syntax, HDLRuby includes all the syntactic construct of the Ruby language, with additional ones for handling HW-oriented descriptions. These new constructs include for instance HW-specific literals (e.g., the "Z" state), and new constructs for describing signals, processes, instances or modules.
Synthesizability of the language is ensured by keeping an RTL model of computation, while the productivity-oriented features act only at the generative level, similarly to the implementation of object-orientation and genericity in the C++ language. In details, the SW paradigms were adapted to HW as follows: • object-oriented programming and reflection: all the elements of an HW description are Ruby objects, and therefore include standard as well as reflection-oriented methods. However, these methods are programs that generate RTL code and not programs to be executed by the final circuit.
• generic programming and metaprogramming: they are inherited by construction from the underlining RTL code generation engine implemented in Ruby.
Since the previous publication (6) , HDLRuby has been improved and the syntax slightly changed. The left part of Fig. 1. gives an up-to-date example of a HDLRuby description for a shift register whose structure is defined from a generic argument that can be a bit width, a range or an explicit data type. In the figure, the first line declares the HW module named sreg with typ as generic parameter. The second and third lines analyze this parameter and convert it to a vector type in case it is not already a type. This statement makes usage of the reflection-oriented Ruby method is_a? that checks the class of an object. Lines 5-7 declare the input and output signals of the register, namely, clk for the clock, rst for the reset, d for the data input and q for the data output. The type of d and q is the type of one element of the register and is obtained by typ.base. If the data type does not support sub elements, a compile error will be raised. Line 9 declares the storage of the register named buf with typ as data type. It can be seen from these initial declarations that HDLRuby is object-oriented and reflection-centric: the data type of the elements of a vector type is obtained directly from it through the base method, and the declaration of a signal is done similarity through the respective input, output and inner methods of the relevant type element. For the case of clk and rst the type is implicitly set to bit (i.e., single bit). Lines 11-18 describe the process handling the update of the register. Line 11 indicates that the process's statements are non-blocking (par) and activated on the rising edge of clk (clk.posedge). The next line checks if there is a reset (hif and helse are keywords describing HW 'if' and 'else'). In case of reset, buf is set to 0, otherwise, each of its elements is linked in chain using the range objects of Ruby: [-1..1] and [-2..0] represent respectively the range from the second (1) to the last element (-1) and the range from the first (0) to the second (-2) last element of buf. The equivalent VHDL or Verilog HDL would look similar, however a generic type argument would not be supported the only possible genericity would be the width of the register. To see the difference, the right part of Fig. 1 gives a few examples of instantiation of this register. The first instance is an 8-bit shift register, the second is a shifting buffer of 16 characters, and the last is a shift register containing a floating-point value. Those three instances would require each a different HW description with traditional HDL, i.e., about three times more code.

HW Design Patterns
Several kinds of circuits can be described as sets of finite state machines (FSM), decoders, and arithmetic and logic units (ALU). Components like FSM or decoders may look quite generic but are in practice target-specific and difficult to include in an HDL without losing generality. As a matter of fact, explicit constructs for such components are not present in the standard HDL like VHDL or Verilog HDL. The approach for HDLRuby, is to keep a very general language core, and to provide libraries of template components that can be parameterized and grafted into general RTL descriptions. Such libraries are possible thanks to the metaprogramming capability of the language (6) . We present here the FSM and the decoder templates that have been used in the description of the MEI8 processor.
a. The FSM Template This template allows to describe synchronous, asynchronous, mixed and single or double-edge FSM by simply specifying the states, the corresponding actions and a few optional configuration parameters. Fig. 2. gives an example of a globally asynchronous FSM with only a few states that are synchronous. This figure is a simplified  version of the MEI8 main FSM where interrupts, IO bus accesses and specific instructions handling have been removed. In the figure, the first line is the header of the FSM and indicates that by default its output signals are generated asynchronously (:async), that the state transitions are performed on the rising edge of signal clk and that the reset is done on signal rst. Line 2 gives the default actions and line 4 gives the action to perform in case of reset (here, setting the program counter and the instruction register to 0). The first actual state is described from line 6 and is named :re. Here, state is used for defining a state. This state, asynchronous by default, has also a synchronous part added line 7 through sync. For this state, no transition is specified and therefore the FSM will go by default to the next declared state, i.e., :fe. State :fe is only synchronous and is therefore added through sync. This state does not have any transition specified either and therefore goes to the following :ex state. An explicit transition can be specified using goto, like in line 16 where the next state is set to :fe. It is also possible to set multiple alternative next states depending on a condition as it is done line 13 where depending of whether signal branch is 1 or 0, the next state will be :br or :fe (goto is implemented like a multiplexer so that the number of possible target states is not limited).
For comparison, Fig. 3. gives a the equivalent VHDL code. The code is significantly longer and complex (e.g., it includes two processes, two case statements).

The Decoder Template
This template allows to describe a decoding circuit by providing a list of decoding formats and the corresponding actions. Fig. 4. gives an example of a decoder with three different formats. This figure is also an example taken for the description of the MEI8 processor, more precisely it is a part of its instruction decoder. In the figure, before the decoder is described, line 1 sets accumulator a (index 0 in the register file) to be the default destination register of the ALU by assigning its index (0) to signal dst. Line 2 is the header of the decoder and indicates that signal ir (instruction register) is to be decoded. The remaining lines describe the behavior of the decoder as a list of entries, the first one having the highest priority. The circuit of this example being an instruction decoder, the action of each entry is mainly to set up the links between the arithmetic and logic unit (alu) and the registers. The first entry of the decoder describes the case where all the bits of ir are 0. It corresponds to the nop instruction (no operation) and sets signal wr to 0, indicating that the destination register should not be written to. The next entry describes the other cases where the two upper bits of ir are equal to 00. It corresponds to the register moves (copy between general purpose registers), or to the assignment of 0 to the destination register. For this entry, ir is decomposed into two fields, one three-bit field x and one three-bit field ynotice: a field name is always one character long, its occurrences in the entry indicating the bits used for the field. If both x and y are equal, the ALU of the processor is set to produce a 0, otherwise, the ALU is set to transfer the value of register number x. Finally, destination register index dst is set to y. The last entry of the example describes the case where the first two upper bits of ir are equal to 01. It corresponds to the standard arithmetic and logic operations. For this entry, ir is decomposed into field o (3-bit) that indicates the operation code, and field y (3-bit) that gives the number of the second source register. Line 14 sets up  the links to the ALU circuit with respectively the operation (field o) the first source register (accumulator a) and the second source register (whose index is obtained from field y).
Equivalent code in Verilog HDL or VHDL will require several if and case statements, additional signals declaration for assigning the fields of the ir register, and the connections to the ALU circuit would also require extra signals and statements since in such HDL, function call-like connection to instances is not supported. The sample code is omitted for the sake of conciseness but a VHDL version can be found at the MEI8's code repository (8) .

Methodology
In its current state, the HDLRuby toolchain can automatically compile and convert an HDLRuby description into Verilog HDL (20) or VHDL (21) . For our experiments the toolchain has been used to produce VHDL code compatible with both FPGA and IC RTL synthesis. The resulting MEI8 cores have been tested for executing a program including all the instructions of the processor and routines for handling interrupts 1 and system calls.
In order to estimate the potential of HDLRuby for improving the design productivity, we compared the code describing the MEI8 processor with the corresponding VHDL code using several code metrics. However, while the HDLRuby code has been written from scratch, the VHDL code is based on the code generated by the HDLRuby design tool. Please refer to section 4.3 for more details about this choice. Several metrics exist for estimating the quality of software code (22) . While there is a lack of metrics for estimating the quality of HDL descriptions, the similarity in structure between HW and SW descriptions makes it possible to use the existing SW metrics for comparing HDLRuby with VHDL. In this paper we considered the following metrics: • Lines of Code: the number of lines of code (LOC, also called SLOC, for Source Lines of Code). • Variables: the number of variables and signals.
• Assignments: the number of signal assignments and connections. • Operations: the number of operations (arithmetic and logic operations, bit selection, moves, and casts). 1 Both interrupts have been raised during the test.
• Controls: number of control statements.
• Cyclomatic complexity: the number of independent decisions in the code, i.e., the number of independent alternatives in 'if' and 'case' statements. In addition to these SW-oriented metrics we added the following HW-oriented metrics: • Processes: the number of explicit processes.
• Bit literals: the number of bit vector literals.
• Bits in bit literals: the total number of bits in bit vector literals.
There is a lack of research about metrics for estimating the quality of HW code. However, the number of processes can be a source of errors since they make it more difficult to track the state of a signal. The number of literals is usually not considered as an issue when estimating the quality of SW. Yet, in HW descriptions there are often a lot of bit vector literals that are much more error-prone than the literals used in SW. Since, the probability of an error increases with the size of the literals, the total number of bits in bit literals is also used as complexity estimator.
In addition to the code complexity, we estimated the effort required for the implementation from scratch of the HDLRuby code. This estimation has been done using code churn-based metrics. More precisely, we counted the lines of code added and deleted for each commit to the repository and extract from these data the following metrics (19) : • Number of commits: the total number of commits to the code repository. In order to summarize the total effort required for designing the processor with HDLRuby, the following metric have also been added: • Written LOC: the total number of LOC written when designing the HDLRuby description of MEI8. • Written rate: the rate between the written LOC and the final LOC. These code churn-based metrics have not been used for the VHDL code because it has been written based on the already designed HDLRuby code. Table 1. compares the quality metrics for the HDLRuby and the VHDL code respectively, the lower part being dedicated to the HW-specific metrics. In the table, "Ratio" is the ratio between the VHDL metric and the corresponding HDLRuby one. On average on all the SW-oriented metrics, the VHDL code is 2.33 times more complex than the HDLRuby code, with a standard deviation of 0.77. For the HW metrics, the number of processes is consistent with the SW metrics whereas the bit literal metrics are much more in favor of the HDLRuby description. Globally, the complexity of the VHDL description is more than twice the one of HDLRuby. Fig. 5. gives the number of lines added and removed for each commit to the code repository of the HDLRuby description of the processor. From these raw data, Table 2. gives the resulting churn metrics. In the table, the small number of commits, and the small rate between the total written LOC and the final LOC of 3:1 tend to show that the design effort was indeed small, even so relatively to the small LOC of the HDLRuby implementation. In total, the estimated design time of the processor is about 20 hours 2 .

Results
HDLRuby has been designed to be synthesizable RTL so that the increase in productivity should have little impact 2 Only limited time a day could be assigned to this design. on the performance of the result. For evaluating if this assumption holds, the processor has been mapped onto the Xilinx Virtex-7 FPGA VC707 Evaluation Kit board (23) (28nm technology) using the Vivado (24) tool chain and implemented down to DRC check using the Alliance (25) tool chain targeting a 0.5µm CMOS technology with average cell area of 1837µm 2 (the default for the tool chain). Table 3. gives the synthesis reports for the FPGA, and Table 4. gives the reports for the IC mapping. As it can be seen in the tables, the resulting processor is very small. This was expected, since it is an 8-bit RISC processor. But it is also fast enough to run at 100MHz for the FPGA implementation and 28MHz for the post-synthesis simulation for the IC implementation. Since the processor is able to execute one instruction per two cycle, with an extra cycle required for branches and other extra cycles required when accessing the external memory, its average performance is between 50 and 40MIPS (Million Instructions Per Second) for the FPGA implementation and between 12 and 14MIPS for the IC implementation.
At last, the productivity in term of gates per designer a day can be estimated using the number of gates of the IC of Table 4. and the design time in days (assuming 8-hours working days) as follows: 9199 ÷ (20 ÷ 8) ≈ 3680 gates, i.e., an order of magnitude of 3000 gates a day. While coarse, this value is significantly higher than the average one for RTL design, i.e., about 800 gates per designer a day (1) .    Fig. 5. The code churns in LOC for each commit.

Discussion
How to estimate the productivity of a language is a difficult research topic and as far as we know has never been addressed in the context of HW design -please note that we are not talking here about the performance of the synthesis tools. In this paper we used estimators that could be questioned. For instance, the LOC is often said to be a poor estimator of code quality (26) . Hence, several other estimators have been used. It can also be objected that SW estimators are not relevant for HW design, and so we also used a few HW-oriented ones. Still, such estimators have not been studied so they are to be taken with caution. Another difficulty for estimating the productivity of HDLRuby, is that some of its features (e.g., reflection) are mostly overlooked by the code quality estimators. The evaluation of the design effort has been made using the code churns. This is to our knowledge the first time that such metrics have been used in HW design, so that even though the 3:1 ratio of written over final LOC is a lower than average effort, it is hard to draw a definite conclusion from this result. By contrast, the evaluation of the designer productivity in number of gates a day is commonly used, and our result, indicating a significant increase (3680 against about 800 on average (1) ) is promising. However, this estimate depends on the target circuit. For a processor, the productivity can be low, since such a circuit lacks the regularity that would greatly benefit from generic programming. Moreover, the choice of making a design from scratch forbade the use of IP components while this is usual the case with recent designs.
Ideally, the HDLRuby and the VHDL code should have been written in parallel by experimented HW designers. Such a setup was difficult in practice for our research structure where only one person was available for the task. Then, a same person reimplementing an already existing circuit would have suffered from influences from the existing code and design choices. That is why a less time-consuming compromise has been selected: since the design in HDLRuby would anyway bias the other implementations, it has been decided to use the code generated by the HDLRuby tool chain as basis for writing the VHDL code. The core of the work was then to improve the compactness and the style of the VHDL code. Further, it has been decided to count as line of code the line of comments in the HDLRuby code, while the corresponding VHDL code have been left without any comment in order to avoid an artificial increase of the code size. The largest drawback with this approach is that it was not possible to estimate the design time using VHDL for the processor. Another drawback is that the resulting VHDL code may be in fact shorter than VHDL code written from scratch due to a heavy usage of compact logic expressions generated by the HDLRuby engine that are usually avoided by the designers in favor of control flow-like constructs.

Conclusions
This paper presented the implementation from scratch of MEI8, an 8-bit RISC processor using the HDLRuby language. Then, it compared the resulting code with an equivalent VHDL implementation and gave an evaluation of the design effort required for making the HDLRuby implementation. Finally, it gave details about the resulting circuit in FPGA and IC versions and used these figures for obtaining an estimate of the average productivity in number of gates produced a day. The comparison showed that the HDLRuby code's length was less than a half of the VHDL's. Moreover, with HDLRuby, the required design effort proved to be low while the productivity was about four times higher than a standard RTL approach.
While this study shows some of the potentials of HDLRuby, the limitations of the used estimators invites to perform evaluations with several other kinds of circuits. Moreover, not all the features of HDLRuby have been evaluated in this paper, especially the extensive generic programming and reflection features of HDLRuby have not been fully addressed. Preliminary such evaluations have been made (27) , but more throughout works are required.
Regarding HDLRuby, we plan to take advantage of the plasticity of the language for providing libraries for supporting IP, the dynamic partial reconfiguration capabilities of FPGA-based devices, the description of SW executed on processor cores, and the description of high-level communication protocols.