16-bit RISC Processor Verilog Code with Clock Gating

                    Clock Gating in 16-bit RISC Processor

Clock Gating is a technique where we provide a clock signal to a component or module only when it is needed. This is done to save power and only operate the running logic. This is mainly used in synchronous circuits. We can form a simple circuit of gating using AND gate. The output of AND gate must be connected to the ENABLE port of the microcontroller individual components. The first input of the AND gate will be the original clock signal being generated by the oscillator. The second input will be a control signal. 1 is the control signal, the module will be turned ON. 0 is the control signal, the module will be turned OFF.

Having ENABLE port is must to have in each of the components. Enable ports can be positive enable or negative enable. 0 will turn off the components having positive enable and 1 will turn off the components will negative enable. As per Wikipedia, this also helps to save die are on which the circuit is fabricated. After all saving power is what we need.

This technique is mostly used in low power circuits which are intended to run from a 1.5V battery for a year. We can insert this technique via behavior modeling, RTL modeling. However, it is very important to verify the output as wrong switching of the clock will lead to the wrong supply of information and data. In my previous post, I had shared the RISC Processor code. without clock gating. The previous code has no enable ports on any components which inherently increases the power consumption if we can assume it virtually. The current code runs with GATING using AND gates. The Control Unit controls the gating signals for all the components. Control Unit receives clock signal all the time. The current code still has a clocked signal in the Control Unit. A further update might include some other logic. For now, let us understand this code.

Here is the RTL Schematic for the gating processor.

   Here is the screenshot of gating signals clocks of all components.

 In RTL Schematic you would observe that certain components are not connected/ wired. This is not an error but an optimization. When input or output does not change, it is taken as a constant by the simulator. Hence, it's wiring is trimmed. In the above RTL, the register file is an example. Now as per my previous post, each state will enable the clock for the next cycle. This will be controlled by the Control Unit. 

For example, in the below circuit, you can see clock gating.

Now the CU controls each AND gate to provide the clock to each component when needed. Now how our processor will work with GATING is as follows.

IM will switch clock for ID state and will turn off the switching for every other stage. In the ID stage, we will switch the clock for ALU stage and turn it off for other stages. When in the ALU stage, we switch clock for the MEM stage and turn it off for all other stages. When in the MEM stage, we will switch the clock for the WB stage and turn it off for other stages. When in the WB stage, we will switch clock for IM stage and ADDER stage and turn the clock off for other stages.

Code for 16-bit Gating Microcontroller (Pop Up Warning)Click here for Verilog Code
Below you can see the output of instruction not including BRANCH instructions as it wasn't fitting the screen. I would request users to decode and match the result and try to deduce what is happening in the processor, how output is being written in the registers etc.

and the beautiful gating clocks is here too.

Observe that after the MEM stage, we will have to clock the Register File to perform write step into write register. So after the MEM stage, we require clocking of Register File and Adder to.

Here is the power analysis of the above processor with GATING.

As per the analysis, my processor uses only 0.014 Watts. That's way less than the previous processor without GATING which was 0.089 Watts. This analysis is not recommended. I doubt whether it is correct or not. But for some minor assumptions I guess we can consider for a moment. Want some explanation about the result above or any other diagram? Comment !!

Verilog Code of 16 Bit RISC Processor with working

Verilog Code for the 16 bit RISC Processor 

Hello Everyone, I know many of you out there have been waiting for the working code for this processor along with RTL Schematic. Well, I have successfully coded the single cycle processor with R format Instruction, I format and Branch instructions too. I'll start with my instruction first. It is 16 bit in length.


xxxx indicates the OPCODE which decides the operation which has to be carried out.
yyyy indicates the location of Register 1.
zzzz indicates the location of Register 2.
qqqq indicates the location of Register where data has to be written. It also helps to determine the number of instructions the user wants to jump.

The code I have devised takes 5 clock cycles to execute a single instruction. For BRANCH format, I stall the processor for 1 clock cycle. So, to make it clear, we have 5 stages here,

1. Instruction Fetch Stage
2. Instruction Decode Stage
3. Arithmetic Stage
4. Memory Stage
5. Write Back Stage

In the first stage, we will extract the instruction from the Instruction memory which contains the opcode, and register addresses. In the Instruction Decode, The register file will receive the register addresses and well then extract the data from each register which has to be sent to the ALU. In the Arithmetic stage, the ALU will receive the opcode from the control unit and will then perform the Arithmetic operation according to the opcode. After the Arithmetic stage comes to the Memory stage.
It depends on two signals. One is the read and the other one is the write. If read signal is activated it will then read the value stored in the address received from the Arithmetic stage and will output the same. In the write stage, it will write the Register B data in the address received from the Arithmetic stage. Finally comes the Write Back stage. This will write the data in the write register in the register file.

Don't consider these stages as pipelines as for now. Pipelining is easy and will upload in coming days.

Here the datapath of my version of the single cycle processor.

RTL Schematic Lovers!! Here you go:

Do remember one thing. Ideally, one shouldn't have any output other than display. Our data path has output as many "Internet Guys" have asked me to do so without learning how to do so. 

Program Counter:

module PC(in,pc_select,clk,out);
input [7:0]in;
input clk,pc_select;
output reg [7:0]out;
initial begin
out = 0;
reg [8:0]temp;
always @(posedge pc_select)begin
temp = in + 2;
out = temp[7:0];

Program Counter is used to send out instruction location to the Instruction Memory(IM) from where IM will select out the instruction to work upon. The logic is to increment the counter by 2 units at every posedge of pc_select signal whenever it goes to high. This signal is begin driven by the control unit. I have used temp as a temporary variable to store carry if bit length gets exceeded.  Now the question arises, why 2 units and not 1. Follow the below example.

Let us take a16-Bit instruction as 1010_0000_1111_0011. Now One location is of 1 byte. This will store 1010_0000 and the other location is of another 1 byte will store 1111_0011 i.e. 8 bits. Now in the instruction memory, the instruction is stored as

Imem[1] = 7'b1010_0000
Imem[2] = 7'b1111_0011

The instruction which is of 16 bits will consist of two 8 bits i.e. 2 bytes. This will execute out of the instruction memory as {Imem[1], Imem[2]} which is 16'b1010_0000_1111_0011. Now the next location where the second instruction will start is from Imem[3] so the 2nd instruction will be {Imem[3], Imem[4]}. So a careful observation will tell that every new instruction begins at 1, 3, 5, 7, 8......

Imem[in] = Imem[1]
Imem[in + 2] = Imem[3]

I hope, everything is clear now.

Instruction Memory

module INSTRUCTION_MEMORY(address,clk,opcode,A_reg,B_reg,W_reg,Sign);
input [7:0]address;
input clk;
reg [3:0]dest;
output reg [3:0]opcode;
output reg [3:0]A_reg;
output reg [3:0]B_reg;
output reg [3:0]W_reg;
output reg [3:0]Sign;
reg [7:0] imem[0:17];
reg [15:0] instruction; 
initial begin


















always @(negedge clk)begin

instruction = {imem[address],imem[address+1]};
opcode = instruction[15:12];
A_reg = instruction[11:8];
B_reg = instruction[7:4];
W_reg = instruction[3:0];
Sign = instruction[3:0];


Instruction memory will output the opcode to Control Unit, Register 1 address, Register 2 address, Write register address and Branch index. Opcode is of 4 bits, same for Register 1 and 2 and for Write register and Branch index.

Register File

module REGISTER_FILE(clk,readA,readB,dest,data,reg_wrt,readA_out,readB_out);
input reg_wrt;
input [3:0]readA,readB,dest;
input [7:0]data;
input clk;
reg [7:0] Register [0:15];
initial begin
Register[0]=0;//R0 alwayscontains zero
Register[1]=2;  //Random values stored
Register[5]=10; // You can change any value within this initial block
output reg [7:0]readA_out,readB_out;
always @(negedge clk)begin
readA_out <= Register[readA];
readB_out <= Register[readB];


The register file will get the input from Instruction Memory with register address to read from and write register to write data into. Whenever the reg_wrt flag is high (from Control Unit), the incoming data will get stored in Write register. In the initial begin section, I have stored manually some values. Here "readA" is the address of register received from Instruction Memory. A similar condition is for readB.

Arithmetic Logical Unit.

ALU receives the opcode from ALU which will dictate the ALU which operation it has to perform. Currently, I have included add, subtract, increment, decrement, logical operations along with Branch instructions. For every code, we will store the carry bit and zero bit which will be used to carry. For BEQ function, if Reg_1 is equal to Reg_2 then z will turn to 1. The opposite is the case for BNE function. For BLT(Branch if Less Than) if Reg_1 is less than Reg_2, carry bit go high. The opposite is the case for BGT (Branch if greater than). I'll remove some functions from the ISA like logical OR, NAND to put ADDI AND SUBI functions later.

Data Memory

Nothing special about this module. It has two signal, re for reading and we for write. When "re", it will read from a location received from ALU. When the signal is "wr", it will write the data received from Reg_2 from Register file to the address received from ALU.

Mux_1, Mux_2, and Mux_3 decide the flow of data. Mux_1 decides the write register between R format and I format. Mux_2 decides the data that has to be forwarded to the ALU between load, store functions, and R format functions. Mux_3 will decide whether it has to catch the output from Data Memory of will just carry forward the result from ALU.

Mux_4, Mux_5, Mux_6, Mux_7 decide the branch function working. Each receives two addresses, first one is the normal execution of PC to get the new address which the second port receives data from adder which is the location of the new jump address. Mux_4 is for BEQ, Mux_5 is for BNE. Mux_6 is for BLT and Mux7 is for BGT. The Mux_8 decides which data from the Muxes will move ahead to the PC.

Why Sign Extend and Left Shift?

Coming ahead to Sign Extend and shift left 1. Whenever a user gives an instruction of BRANCH, he/ she specifies the new jump address i.e. he/ she does not specify the IM address. It tells us how many instructions to jump. Let us take some instruction as specified below








Now the user wants to jump from imem[0] by 2 locations. He/ She will put out an instruction:
BRANCH $r1 $r2 2.
Don't confuse with PC code when I had incremented with 2 units. The user doesn't know anything about inside architecture. If He/ She wants to jump by 4 locations it will be as follows 0 --> 2 --> 4 --> 6 --> 8

Now the input to sign extend will be 0010 from imem[1] and 0010 is instruction[3:0]. Sign extension means to replicate the MSB to the bit position ahead of MSB. Hence, extended data will be 00000010. Shifting this left by 1 bit will give us 00000100. Now add this to "current" address which is imem[0] i.e. 0.The result will be 8'b00000000 + 8'b00000100 = 8'b00000100 which is 4. This 4 will be fed to PC and this his new instruction will be from imem[4], thus it successfully, jumped by 2 locations.

Do the same for imem[4]. instruction[3:0] shows that the user wants to jump by 3 instruction. Forget the opcode for now as this is intended just for an example. Here sign extended bit will be 00000011 and with shift left, it will be 00000110. Adding this to "current" location i.e. 4 which will give us 00000100 + 00000110 = 00001010 which is 10. Thus, our new instruction will be at imem[10].

For a 32-bit processor, we will have to shift by 2 bits. Still confused, how we are jumping? Comment and I'll solve your doubt. 

Control Unit

The Control Unit is what I would call as the heart of this processor. It is his responsibility to switch Mux and signals at the right stages in order to give the correct output. I have divided the stages as discussed above in 5 states.

1. Instruction Fetch---- s1
2. Instruction Decode--s2
3. Arithmetic Stage----s3
4. Memory Stage ------s4
5. Write Back Stage----s5

While working on the Control Unit, one must remember that, although every module has been provided with a clock, it does not mean, they will execute simultaneously with thorough data. For example, IF stage will take 1 CC to move data from input to output. ID stage will only be able to work on input when the IF stage sends the data on its output. Thus the 1st CC for ID is wasted and at the second CC, it works on the instruction received and itself takes 1 CC to produce its own output. In my code, s0 state is the initial state, means rest stage. From s0, it will move to s1 and then will keep rotating to s5 and back to s1.
Now in s0 stage, we will decide the signals required to prepare for stage s1. Similarly, when in stage s1, we will switch those signals which will be required for stage s2 and so on. At the end stage, s5, we will change signals which will be required for stage s1 / IF stage. So for the IF stage, we require pc_select signal to be set as 1, so a new instruction location can be sent to the IM. Note that you cannot switch the signal, when you are already in the state i.e. if the processor is in state s0, it cannot set pc_select at that state because it will get activated in the next clock cycle. So after the next CC, our processor will be in ID state with pc_select set as 1 which is useless or basically say a big error. 


It is because pc_select will signal IM(Instruction Memory) to release a new instruction while the previous instruction is still in the ID(Instruction Decode) stage. At this moment, the destination register will change to the register address included in the new instruction. This will cause an error as we wanted to write our previous instruction's data to different register but now, it will get written to a different one. Hence, one can also conclude by this that, one has to hold the pc_select i.e. new instruction fetch until the longest instruction cycle has completed all of its states/ stages.  

For a better understanding, have a look at the below image to understand.

Black lines indicate processing of instruction A and green lines indicate processing of instruction B. One must be wondering that the WB stage of A instruction and IF of B instruction is happening at the same time. Well in that CC(clock cycle) we are fetching instruction for B. Until that CC completes, the output will not be available for the register file. While the previous instruction is ready to write at the write register. Overriding the write register address will take place at the next CC. i.e. the ID stage i.e. the 2nd CC for B signal. The IM has sent the address to Register File but the Register File will override the incoming register addresses until next CC. So at 6th CC, this will take place while the WB for A instruction will take place at the 5th CC.

For an overview explanation of the control unit again, all one has to remember is to always put out those signals which you want to activate for the next state. For example, state s0 will contain the switch of signals which will be required for s1 state. Similarly, s1 will contain signals that are required for s2. s2 state will contain signals which are required for s3. Similarly, s3 will control signals for s4 and s4 will control signals for s5 and s5 will control back for s1. Hence, when the processor is in the WB state, it will contain this piece of code.

pc_select <= 1;

which means that as soon as WB state (s5) completes its execution, pull the pc_select line high so the in the next state s1, thus s1 will have all the required components to work. Suppose you are hungry, so won't you want your mother to keep food ready at home as soon as you arrive there or will you choose to cook after reaching home. Similarly, each state will prepare food (switch appropriate signals) for the next state so as soon as the next state (you) arrive, you get your food (data flowing in switched signals).

Please comment, if you want a detailed explanation of Control Unit and its code.

The ISA of this processor consists of 15 instructions.

BRANCH Instruction working.

Have a look at those complex multiplexors in the datapath diagram at the top of the post. Well, it depends on the programmer, how he/ she decides to wire his circuits to execute the instructions. Now we have 4 Multiplexors. Mux4 will work for BEQ instruction which is Branch if Equal means if $r1 == $r2 then Branching will start. Mux5 will work for BNE instruction which is Branch if Not Equal means if $r1 is not equal to $r2 then Branching will start. Mux6 will work for BLT instruction which is Branch Less Than means if $r1 is less than $r2 Branching will start. Similarly, Mux7 is for BGT which is Branch Greater Than means if $r1 > $r2 Branching will start. 
  The select line for Mux1 (BEQ) is controlled by the output from AND gate. The inputs of AND (for BEQ) gate are Z flag from ALU and another from Control Unit. So if we want BEQ instruction, the select line from Control Unit is pulled HIGH. Now, there are two cases here. Firstly, if $r1 == $r2 then Z will go HIGH. Now the input for AND gate is 1 and 1, hence the output will be 1. Meanwhile, all other selects lines for Mux5, Mux6, Mux7 will remain low so the AND gate output will be 0 for those. When we concatenate AND gate outputs for each Mux signal, we will get 1000 for BEQ. Similarly, for BNE, we will get 0100. For BLT it is 0010 and finally, for BGT it is 0001. The select line for Mux8 i.e. sel8 is of 4-bit length. This select line is concatenated values of the select lines of each AND gate which controls the Muxes. So sel8 <= {sel4,sel5,sel6,sel7}.
Thus the controls for Mux8 will be

1000: in1 ------from Mux4
0100: in2 -------from Mux5
0010: in3 -------from Mux6
0001: in4 -------from Mux7
0000: out ---- from PC. Natural new instruction flow
else: b_out ------- New jump address without dependence on any Flags

For BRN, we need to pull down all select lines to Muxes via AND gate. Thus the output of AND gates will be low which will output 0. Its concatenation will be 0000. Suppose we want BEQ instruction but $r1 != $r2. This will put the Z flag to 0. For BEQ only select line for AND4 will be high to select Mux4. So AND of 0(Z) and 1 from Control Unit will be 0. Thus our output for sel8 will be again 0000 which means, carry out normal flow of instruction in sequence and do not jump. To branch directly all we have to do is pull all select lines high which go to then AND. This will set O/P to 0 thus all bits will be 0000 irrespective of flags.
                So the question here is, how will pulling all lines set the Mux8 to take the b_out (branch_out) instruction as an input. Suppose Z and Carry are equal to 0. Thus ~Z and ~Carry will 1. These signals go to each of the gates whose another input is already high by our Control Unit(Control Unit). So the O/P will be as follows for sel8.

AND4 inputs are Z and from CU
AND5 inputs are ~Z and from CU
AND6 inputs are Carry and from CU
AND7 inputs are ~Carry and from CU

Z = 0 Carry = 0 Control Units pulls all signal to 1
AND4: 0 & 1 = 0
AND5: 1 & 1 = 1
AND6: 0 & 1 = 0
AND7: 1 & 1 = 1
Concatenate result: 0101

Z = 0 Carry = 1 Control Units pulls all signal to 1
AND4: 0 & 1 = 0
AND5: 1 & 1 = 1
AND6: 1 & 1 = 1
AND7: 0 & 1 = 0
Concatenate result: 0110

Z = 1 Carry = 0 Control Units pulls all signal to 1
AND4: 1 & 1 = 1
AND5: 0 & 1 = 0
AND6: 0 & 1 = 0
AND7: 1 & 1 = 1
Concatenate result: 1001

Z = 1 Carry = 1 Control Units pulls all signal to 1
AND4: 1 & 1 = 1
AND5: 0 & 1 = 0
AND6: 1 & 1 = 1
AND7: 0 & 1 = 1
Concatenate result: 1010

Thus whenever the select line sel8 will have two 1's then it will jump to the new branch address (b_out).

Confused about something?

To get the code:

1. Disable your Pop-Ups
Upon clicking on the below link, there will be 4 pop up tabs which contain code.
Click here for Verilog Code

Verilog Code for I2C with RTL Schematic

Hi Guys,

Long time now. I was away dealing with my crappy life. Well, let's move to the main point as the title of this post suggests. This I2C is very much less complicated than all my previous I2C Verilog codes. The biggest surprise to my readers in this post is that this I2C has an RTL Schematic. I'll clear out everything about the code.

Let us begin.

I'll begin with the explanation of Master code first. I am assuming here that the readers have already read the I2C Basic post. If not then search this blog else move ahead.

In this I2C, I am assuming a few things.
1. It is a READ mode I2C only. No, Write mode. That would complicate a hell lot of stuff.
2. There is no malfunctioning with Master when it receives data from Slave, hence the acknowledge bit will be always 1(Not for Address matching).

What is in this I2C Code:
In this code, the master will send start bit and SCL will start. With this Master will start sending 7 bit Slave address. If the address does not match, it will restart i.e. keep on sending the same address. If the address matches, the Slave will send the acknowledgment. With acknowledgment, the Master will send the 8-bit register address.

Note: The code currently has no code to execute if register address matching fails. Might be updated later

After matching the register address, the Slave will send acknowledgment bit. After that, it will send the data associated with that register address. This will end with Stop bit.

Note: I still have some doubt on Start Bit and Stop bit.

To get the Master Code:
1. Disable Popups
2. Click Here to get the Master code
3. Click Here to get the Slave code

Master Code Description:

Line 1-4: I assume one knows what is this!!
Line 6: Register address from where data has to retrieve.
Line 7: SCL flag decides whether it is 1 or 0
Line 8: Slave_address of 7 bit. It contains the address of the Slave to be communicated.
Line 9: Start_Flag decides the start condition of I2C
Line 10: Status to signal the stop condition

Line 12: sda_val Value assigned to this will be sent through SDA line
Line 11: Direction chooses whether the signal is from or to.
Line 13:
Line 14: Contains slave address along with Read bit (hardcoded to 1)
Line 15: Count To count the clock
Line 16: Keep a Backup Copy of Slave Address
Line 17: Stores the Acknowledgement Bit
Line 18: Stores the data from Slave
Line 19: Used for START signaling
Line 20: Used for START signaling

Line 39-42: With falling SDA line checks if SCL is 1, then set A to 1
Line 46-50: If SDA is rising and SCL is 1 then set B to 1
Line 52: Set Start flag to ex-or of A and B
Line 54-58: With the rising clock, if start flag is 1, complement SCL(SCL clocking begins) else set SCL to1

Line 62-92: Check if SCL is 0 as SDA will only change when SCL is not changing. Then we will have to check the start flag. If it is set to 1, the transmission will begin. With first 8 counts, 7bit address and 1-bit read bit will be sent to Slave. SDA is of inout type which is a wire. This can only be used as a tri buffer. With count as 9, the direction will be set to 0 which sets the SDA line to 1'bZ. This will accept data from Slave now about the acknowledgment. For count as 10, we will set direction to 1
which means, Master will send data. The acknowledgment will be stored. If it is 1, which means the address is correct, we will send the register address from where we want data. If acknowledgment bit is 0, set count to 1 and retransmit the Slave address. This will keep on happening. When the count is 18, the direction is again changed to get the acknowledgment of register address. If 1, then proceed else NOTHING WILL HAPPEN as I HAVE NOT YET PUT A CODE WHICH WILL EXECUTE WHEN REGISTER ADDRESS FAILS.

Line 93-97: This will store the incoming data from Slave
Line 98-101: Acknowledgment informing the Slave about a successful transaction.
Line 102-105: Set SDA to 0 and status to 1
Line 108-110: If Status is 1, set SDA to 1.

Since SDA will always change when SCL is 0, Line 46-49 will never come true. However, we are deliberating setting SDA to 1 when SCL is 1 in Line 108-109. This will cause the code 46-49 to execute, which in turn set B to 1. Now with B set to 1, the start flag will be set to 0 in line 52. As the start flag is set to 0, Line 58 will execute which will set SCL permanently to 1. Thus I2C Connection closed.

RTL Schematic for RTL Lovers

For newbies, this is the hierarchy

Test.v is of Text Fixture type.
Master.v is Verilog File
Slave.v is Verilog File

Output (Slave Address Matched):

Output (Slave Address Mis-Matched):

Comment !!