ESP32 Cam programming without Adapters!


How to program ESP32cam with an Arduino UNO?


I recently bought an ESP32cam model from Banggood for a very low price at just $4. Well, that was an exiting deal and I knew it was certainly calling me to buy it. I have always wanted a portable streaming module with fewer wires and compact. Using OV7650 module is a mess with Arduino and damn! those wires look like they are inviting a spider within its mesh.

Here, have a gaze at the beautiful ESP32cam with Wifi and a low-quality camera.

Now run to your hobby store and grab an Arduino Uno. Yes, I didn't want to spend extra bucks for FTDI programmer. Better buy the Uno version with swappable IC as we have to remove the Atmel IC from it or else our programming will be a mess.

Open Arduino IDE. Click File and go to Preferences option. Then in the text are on the lower side of dialogue box where it says> Additional Board Managers URL, paste the following: 
https://dl.espressif.com/dl/package_esp32_index.json

Here is a screenshot of that.


Once done, click Tools> Boards> Board Manager and search for "esp32". Click on the latest version to install and wait for an eternity if your connection is poor. My version of ESP32 for this post was 1.0.2. 



Once completed with the above installation, we can move ahead with the circuit section. Follow the below circuit "strictly". I'll explain the reason of being strict here. Remember to remove your Atmel IC from Arduino UNO.




After completing the circuit, head to Files> Examples> ESP32> Camera> CameraWebServer. Everybody's ESP32 can be different. I chose the model of AI Thinker. To know your model, check back on the metal casing of ESP32 to know your model. 

In the code, add the SSID of the wifi you use with its password. ESP will connect with this wifi to stream video. Now connect your Arduino with the cable and check for the following settings.


  • Board: ESP32 Wrover Module
  • Flash Mode: QIO
  • Flash Frequency: 40MHz
  • Partition Scheme: Huge App (3MB No OTA)
  • Upload Speed: 115200
  • Core Debug Level: None
  • Port: <Select the port connected to Arduino>
  • Programmer: AVR ISP


Now HIT it!!! and upload the program. The window will show the program burning onto ESP as per below screenshot


Keep scrolling until you see 100%. A message will get printed on the console saying "Leaving Hard Resetting via RTS pin...". (In the circle in the above image). At this moment, remove the IO-0 pin from the ESP32cam and then open the serial monitor with baud rate at 115200. After opening the serial monitor click on the RESET button of your ESP32cam.  A message will appear as per below image.


You will get the IP address of your WebStream. Copy the IP address and paste it on the address bar of any browser.

Possible Error:
After resetting, you are getting a series of dots and the message "Wifi connected" is not getting displayed. In this case, change your wifi router with which ESP wants to connect. This is majorly because of the security issue. I use my mobiles HotSpot. Avoid iPhones here.

Possible Error:
Brownout detector issue.
In the setup of your code, paste the following:

void setup(){
WRITE_PERI_REG(RTC_CNTL_BROWN_OUT_REG, 0);

At the top include the below code:

#include "soc/soc.h"
#include "soc/rtc_cntl_reg.h"

If successful, you would get a window with the IP address like this below.


Enjoy your stream !!.

Verilog Code for 16 Bit MIPS Pipelined Processor


Hello everyone,

Long time no see. I was actually very busy with my job schedule and then also working on pipeline code. Well, I have successfully completed the pipelined version of the processor. I was working on32 bit but sadly, it had gotten corrupted and I was forced to work on 16 bit which I don't know why I don't like.

What is a pipelined processor?

Below is the processor in action. Be careful what data lines you chose for.




This is the datapath of the 5 stage processor. I might miss some wiring. Do comment if a genius mind finds something different from RTL when compared to the below datapath. If I will find some error, I will myself update it.



Pipelining is a methodology which helps us to parallelly process instructions and only passes that information which is required for the current instruction. If one remembers, that without pipelining when we had set the "regwrt" signal to 1. It will remain 1 until the write operation is complete. However, in pipelining it's the opposite. Each pipeline carries the signals required for the instruction to each stage. 
IFID Pipe will carry the signals and data required for ID stage. At the same moment, IDEXE Pipe will carry the signals and data required for EXE stage. Similarly, EXEMEM Pipe will carry the signals and data required for MEM stage.

The pipelined processor helps to execute multiple instructions at a time. For instance, if an instruction is in EXE stage, another instruction will be in ID stage while another would be in MEM stage and further on. 

However, pipelining is not that simple. Hazards are associated with pipelining.

a. Data Hazards: For example,
                      ADD $1 $2 $3  //It will add contents of $1 and $2 and store it in $3
SUB $3 $5 $6  //It will subtract the contents of $3 and $5 and store it in $6

Now while the ADD instruction is in EXE stage, SUB will be in ID stage. Being in EXE stage, the output of ADD has not yet been written into $3 which means SUB will read wrong or better say old value of $3 from ID stage. This is what we call Data Hazard. We cannot move backward through a pipeline. All pipelines move in only 1 direction. 

b. Control Hazards: Now we come down to the branch instruction. Branch instructions depend mainly on zero flag from the ALU. Now to get the value of ZERO flag, the instruction must be in EXE stage. Since we are in the pipelined mode, another instruction will be in ID stage while another one would be in IF stage. This is NOT what we want. Suppose that the branch instruction sets out to be true then the two instructions in the pipelines cannot be forwarded. This is what we call control hazard.

c. Structural Hazards: This hazard arises when the hardware cannot support what we want. You can't read and write to a register simultaneously.

So how do we resolve data hazards?
a. We can stall the pipeline. I love stalling for no reason though not considered good for a good processor. Stalling pipeline means to stop forwarding the instructions through staging until a required condition is met. Well, we require a data hazard unit to make the processor detect the hazard and then stall the pipeline.

b. Forwarding: Yes, I adopted this technique even though I wanted to use the stall technique. In this technique, we move the data output from EXE stage back to ID stage to overwrite the old value. It might contradict to others that I had earlier said that we can't move back in the pipeline. Well, it still holds true. We can't because forwarding data is not done through the pipeline. It is done through wiring. It will be discussed further. So relax.

c. Scheduling: We can schedule instructions either via compiler or via hardware. We are currently not into it right now.

If there is a dependency between instruction A+1 and instruction A we would require 3 stalls only if ID/EX.WriteReg == IF/ID read-register $1 or $2. However, this is difficult as we are not yet sure what is the destination register. Either $1 or $2. If there is a dependency between the instruction A+2 and A, we would require 2 stalls only if EX/MEM.WriteReg == IF/ID read-register $1 or $2. If there is a dependency between the instruction A+3 and A, we would require 1 bubble only if MEM/WB.WriteReg == IF/ID read-register $1 or $2. ID stage and IFID pipeline register must be frozen at the same Note that stalls stop instructions in the ID stage. So we need control lines to send NOP command i.e. Create bubbles. This can be done by setting all control lines that are passed from ID to 0, hence creating a nop thus preventing new instruction fetches. 

The code for every module is different when compared with the non-pipeline code

The code for every module is different when compared with the non-pipeline code. In the non-pipelined code, the data flowing outwards from a stage were sequential i.e. clock dependent. They only flow outwards when intercepted with an always@(negedge clk) block. In pipelined stages, it is sequential. It flows outwards with the always @(*) block. (exceptions exists). The pipelines however are clocked. Never pass any signal through a pipeline without a clock signal.

Working of Forwarding:

We all know that dependency occurs when the previous instruction wants to write to a register which is required by the next instruction. So when the first instruction is in EXE stage, the previous instruction will be in ID stage. Since instruction 1 will write only after WB stage, this will pose a problem for us or better say hazard. So first we need to know that whether the 1st instruction wants to write or not? It will be unnecessary to stall or forward when no writing is present. So the first condition is:
   if(MEM_regwrt==1)

Similarly, if Write_Register is equal to any of the source registers(A_Reg or B_Reg) from the next instruction then forwarding occurs. So our condition changes to:

  if(MEM_regwrt==1 && MEM_W_Reg==A_Reg)
         ForwardA = 1;
  else
         ForwardA = 0;

Similarly, for the second register, our code will be:

  if(MEM_regwrt==1 && MEM_W_Reg==B_Reg)
        ForwardB = 1;
 else
        ForwardB = 0;

The ForwardA and ForwardB are the signal wires for the multiplexors ForA and ForB. What if an instruction is dependent with 1 location difference i.e. the instruction I has a dependency on I + 2nd instruction. The then I instruction will be in WB stage while the I + 2 will be in EXE stage. For that, we will have the following condition:

if (WB_regwrt==1 && WB_W_Reg==A_Reg && (MEM_W_Reg != A_Reg || MEM_regwrt==0))
     ForwardA = 2;
else
     ForwardA = 0;

if(WB_regwrt==1 && WB_W_Reg==B_Reg  &&( MEM_W_Reg != B_Reg || MEM_regwrt==0))
     Forwardb = 2;
else
     ForwardB = 0;

Enough theory, for now, let us visualize it in step diagram.

Credits: courses.cs.washington.edu
The 5th cycle register writeback is required for other instructions. This is where we require forwarding.

Another example here below shows how instructions are dependent on each other which is mostly the case.

Credits:courses.cs.washington.edu
The AND instruction requires the value of $2 from SUB instruction which will write the result of $1 and $3 in $2. Similarly OR instruction is dependent on the SUB instruction. We do have an option to stall but that would halt the entire pipeline which we wouldn't want. After all most of the instructions in the real world have tons of dependencies.

Stalling can be easily achieved. All one has to do is freeze the PC and IF/ID pipeline register. This would continue the previous instruction for another clock cycle. During the stall condition, we provide NOP opcode which is 0000. I am currently doing the same for Flush where we erase a pipeline data.

Flushing:

If one looks carefully at the datapath diagram, I have included a comparator. This comparator just compares the register values and sends the signal to the Control Unit. The Control Unit looks upon the opcode and then the signal and decides whether to flush the pipeline or not. With a BEQ instruction, the Control Unit would not know whether to flush or not until it reads the Zero flag. To read that flag it will have to wait for 2 clock cycle which is another pain for us. I tried this methodology a lot but I want unable to decide the logic which would tell the Control Unit to stall of flush the pipeline. Do not confuse with stall and flush. The stall is like stopping the flow. What was flowing before will continue to flow for another n clock cycles. It is like a clock enable or disable. Flushing is like erasing the pipeline contents. It doesn't halt the previous instruction. It will just erase the unnecessary instruction. This is done with disabling writing to all components with flushing i.e. a corrupt data is in the pipeline.

Remember that MIPS was designed to avoid stalls. Although, we are not that clever to simulate and replicate a real-life MIPS but who is stopping us from trying?

Comparator: The comparator in the ID stage will help us to reduce two cycle flush during Branch instruction. The comparator will output 4 bits. 1st bit checks for equality function. 2nd bit checks whether value A is less than value B. 3rd bit checks whether value B is less than the value A. 4th bit checks whether value A is equal to value B. This 4 bit value is sent to the Control Unit which decides whether to flush the instruction or not.

CODES:
IF_STAGE
Verilog Code For Fetch Stage 
Verilog Code For Program Counter 
Verilog Code For Instruction Memory 

ID_STAGE
Verilog Code for Decode Stage
Verilog Code for Control Unit
Verilog Code for Register File
Verilog Code for Adder
Verilog Code for Sign Extend
Verilog Code for Comparator
Verilog Code for Decode Pipeline

EXE_STAGE
Verilog Code for ALU
Verilog Code for ForA and ForB Mux
Verilog Code for Execute Pipeline
Verilog Code for Forwarding Unit

MEM_STAGE
Verilog Code for RAM
Verilog Code for Memory Stage
Verilog Code for Memory Pipeline

WB_STAGE
Verilog Code for Write Back Stage

DATAPATH and TOP MODULE
Verilog Code for Datapath and TestBench/ Top Module


 Confused about something? Feel Free to comment.



To Be Continued...With Code as well

16-bit RISC Processor Verilog Code with Clock Gating

                    Clock Gating in 16-bit RISC Processor


Clock Gating is a technique where we provide a clock signal to a component or module only when it is needed. This is done to save power and only operate the running logic. This is mainly used in synchronous circuits. We can form a simple circuit of gating using AND gate. The output of AND gate must be connected to the ENABLE port of the microcontroller individual components. The first input of the AND gate will be the original clock signal being generated by the oscillator. The second input will be a control signal. 1 is the control signal, the module will be turned ON. 0 is the control signal, the module will be turned OFF.


Having ENABLE port is must to have in each of the components. Enable ports can be positive enable or negative enable. 0 will turn off the components having positive enable and 1 will turn off the components will negative enable. As per Wikipedia, this also helps to save die are on which the circuit is fabricated. After all saving power is what we need.

This technique is mostly used in low power circuits which are intended to run from a 1.5V battery for a year. We can insert this technique via behavior modeling, RTL modeling. However, it is very important to verify the output as wrong switching of the clock will lead to the wrong supply of information and data. In my previous post, I had shared the RISC Processor code. without clock gating. The previous code has no enable ports on any components which inherently increases the power consumption if we can assume it virtually. The current code runs with GATING using AND gates. The Control Unit controls the gating signals for all the components. Control Unit receives clock signal all the time. The current code still has a clocked signal in the Control Unit. A further update might include some other logic. For now, let us understand this code.

Here is the RTL Schematic for the gating processor.


   Here is the screenshot of gating signals clocks of all components.




 In RTL Schematic you would observe that certain components are not connected/ wired. This is not an error but an optimization. When input or output does not change, it is taken as a constant by the simulator. Hence, it's wiring is trimmed. In the above RTL, the register file is an example. Now as per my previous post, each state will enable the clock for the next cycle. This will be controlled by the Control Unit. 

For example, in the below circuit, you can see clock gating.


Now the CU controls each AND gate to provide the clock to each component when needed. Now how our processor will work with GATING is as follows.

IM will switch clock for ID state and will turn off the switching for every other stage. In the ID stage, we will switch the clock for ALU stage and turn it off for other stages. When in the ALU stage, we switch clock for the MEM stage and turn it off for all other stages. When in the MEM stage, we will switch the clock for the WB stage and turn it off for other stages. When in the WB stage, we will switch clock for IM stage and ADDER stage and turn the clock off for other stages.


Code for 16-bit Gating Microcontroller (Pop Up Warning)Click here for Verilog Code
Below you can see the output of instruction not including BRANCH instructions as it wasn't fitting the screen. I would request users to decode and match the result and try to deduce what is happening in the processor, how output is being written in the registers etc.


and the beautiful gating clocks is here too.


Observe that after the MEM stage, we will have to clock the Register File to perform write step into write register. So after the MEM stage, we require clocking of Register File and Adder to.

Here is the power analysis of the above processor with GATING.

As per the analysis, my processor uses only 0.014 Watts. That's way less than the previous processor without GATING which was 0.089 Watts. This analysis is not recommended. I doubt whether it is correct or not. But for some minor assumptions I guess we can consider for a moment. Want some explanation about the result above or any other diagram? Comment !!

Verilog Code of 16 Bit RISC Processor with working

Verilog Code for the 16 bit RISC Processor 


Hello Everyone, I know many of you out there have been waiting for the working code for this processor along with RTL Schematic. Well, I have successfully coded the single cycle processor with R format Instruction, I format and Branch instructions too. I'll start with my instruction first. It is 16 bit in length.

16'bxxxx_yyyy_zzzz_qqqq

xxxx indicates the OPCODE which decides the operation which has to be carried out.
yyyy indicates the location of Register 1.
zzzz indicates the location of Register 2.
qqqq indicates the location of Register where data has to be written. It also helps to determine the number of instructions the user wants to jump.

The code I have devised takes 5 clock cycles to execute a single instruction. For BRANCH format, I stall the processor for 1 clock cycle. So, to make it clear, we have 5 stages here,

1. Instruction Fetch Stage
2. Instruction Decode Stage
3. Arithmetic Stage
4. Memory Stage
5. Write Back Stage

In the first stage, we will extract the instruction from the Instruction memory which contains the opcode, and register addresses. In the Instruction Decode, The register file will receive the register addresses and well then extract the data from each register which has to be sent to the ALU. In the Arithmetic stage, the ALU will receive the opcode from the control unit and will then perform the Arithmetic operation according to the opcode. After the Arithmetic stage comes to the Memory stage.
It depends on two signals. One is the read and the other one is the write. If read signal is activated it will then read the value stored in the address received from the Arithmetic stage and will output the same. In the write stage, it will write the Register B data in the address received from the Arithmetic stage. Finally comes the Write Back stage. This will write the data in the write register in the register file.

Don't consider these stages as pipelines as for now. Pipelining is easy and will upload in coming days.

Here the datapath of my version of the single cycle processor.


RTL Schematic Lovers!! Here you go:


Do remember one thing. Ideally, one shouldn't have any output other than display. Our data path has output as many "Internet Guys" have asked me to do so without learning how to do so. 


Program Counter:

module PC(in,pc_select,clk,out);
input [7:0]in;
input clk,pc_select;
output reg [7:0]out;
initial begin
out = 0;
end
reg [8:0]temp;
always @(posedge pc_select)begin
temp = in + 2;
out = temp[7:0];
end
endmodule

Program Counter is used to send out instruction location to the Instruction Memory(IM) from where IM will select out the instruction to work upon. The logic is to increment the counter by 2 units at every posedge of pc_select signal whenever it goes to high. This signal is begin driven by the control unit. I have used temp as a temporary variable to store carry if bit length gets exceeded.  Now the question arises, why 2 units and not 1. Follow the below example.

Let us take a16-Bit instruction as 1010_0000_1111_0011. Now One location is of 1 byte. This will store 1010_0000 and the other location is of another 1 byte will store 1111_0011 i.e. 8 bits. Now in the instruction memory, the instruction is stored as

Imem[1] = 7'b1010_0000
Imem[2] = 7'b1111_0011

The instruction which is of 16 bits will consist of two 8 bits i.e. 2 bytes. This will execute out of the instruction memory as {Imem[1], Imem[2]} which is 16'b1010_0000_1111_0011. Now the next location where the second instruction will start is from Imem[3] so the 2nd instruction will be {Imem[3], Imem[4]}. So a careful observation will tell that every new instruction begins at 1, 3, 5, 7, 8......

Hence, 
Imem[in] = Imem[1]
Imem[in + 2] = Imem[3]

I hope, everything is clear now.

Instruction Memory

module INSTRUCTION_MEMORY(address,clk,opcode,A_reg,B_reg,W_reg,Sign);
input [7:0]address;
input clk;
reg [3:0]dest;
output reg [3:0]opcode;
output reg [3:0]A_reg;
output reg [3:0]B_reg;
output reg [3:0]W_reg;
output reg [3:0]Sign;
reg [7:0] imem[0:17];
reg [15:0] instruction; 
initial begin
imem[0]<=8'b0001_0011; 
imem[1]<=8'b0111_0010;

imem[2]<=8'b0010_0010;

imem[3]<=8'b0001_0011; 

imem[4]<=8'b0011_0100;

imem[5]<=8'b0010_0011;

imem[6]<=8'b0100_0000;

imem[7]<=8'b0001_0010;

imem[8]<=8'b0101_0111;

imem[9]<=8'b0010_0010;

imem[10]<=8'b0110_0010;

imem[11]<=8'b0001_0010;

imem[12]<=8'b0111_0001;

imem[13]<=8'b0001_0011;

imem[14]<=8'b1000_0110;

imem[15]<=8'b0001_0011;

imem[16]<=8'b1001_0001;

imem[17]<=8'b0011_0001; 


end


always @(negedge clk)begin

instruction = {imem[address],imem[address+1]};
opcode = instruction[15:12];
A_reg = instruction[11:8];
B_reg = instruction[7:4];
W_reg = instruction[3:0];
Sign = instruction[3:0];
end

endmodule


Instruction memory will output the opcode to Control Unit, Register 1 address, Register 2 address, Write register address and Branch index. Opcode is of 4 bits, same for Register 1 and 2 and for Write register and Branch index.

Register File

module REGISTER_FILE(clk,readA,readB,dest,data,reg_wrt,readA_out,readB_out);
input reg_wrt;
input [3:0]readA,readB,dest;
input [7:0]data;
input clk;
reg [7:0] Register [0:15];
initial begin
Register[0]=0;//R0 alwayscontains zero
Register[1]=2;  //Random values stored
Register[2]=4;
Register[3]=6;
Register[4]=8;
Register[5]=10; // You can change any value within this initial block
Register[6]=12;
Register[7]=14;
end
output reg [7:0]readA_out,readB_out;
always @(negedge clk)begin
readA_out <= Register[readA];
readB_out <= Register[readB];
if(reg_wrt==1)
Register[dest]<=data;
end

endmodule





The register file will get the input from Instruction Memory with register address to read from and write register to write data into. Whenever the reg_wrt flag is high (from Control Unit), the incoming data will get stored in Write register. In the initial begin section, I have stored manually some values. Here "readA" is the address of register received from Instruction Memory. A similar condition is for readB.

Arithmetic Logical Unit.

ALU receives the opcode from ALU which will dictate the ALU which operation it has to perform. Currently, I have included add, subtract, increment, decrement, logical operations along with Branch instructions. For every code, we will store the carry bit and zero bit which will be used to carry. For BEQ function, if Reg_1 is equal to Reg_2 then z will turn to 1. The opposite is the case for BNE function. For BLT(Branch if Less Than) if Reg_1 is less than Reg_2, carry bit go high. The opposite is the case for BGT (Branch if greater than). I'll remove some functions from the ISA like logical OR, NAND to put ADDI AND SUBI functions later.



Data Memory

Nothing special about this module. It has two signal, re for reading and we for write. When "re", it will read from a location received from ALU. When the signal is "wr", it will write the data received from Reg_2 from Register file to the address received from ALU.



Mux_1, Mux_2, and Mux_3 decide the flow of data. Mux_1 decides the write register between R format and I format. Mux_2 decides the data that has to be forwarded to the ALU between load, store functions, and R format functions. Mux_3 will decide whether it has to catch the output from Data Memory of will just carry forward the result from ALU.

Mux_4, Mux_5, Mux_6, Mux_7 decide the branch function working. Each receives two addresses, first one is the normal execution of PC to get the new address which the second port receives data from adder which is the location of the new jump address. Mux_4 is for BEQ, Mux_5 is for BNE. Mux_6 is for BLT and Mux7 is for BGT. The Mux_8 decides which data from the Muxes will move ahead to the PC.

Why Sign Extend and Left Shift?

Coming ahead to Sign Extend and shift left 1. Whenever a user gives an instruction of BRANCH, he/ she specifies the new jump address i.e. he/ she does not specify the IM address. It tells us how many instructions to jump. Let us take some instruction as specified below

imem[0]<=8'b0100_0011;
imem[1]<=8'b0111_0010;

imem[2]<=8'b0101_0010;
imem[3]<=8'b0001_0011;

imem[4]<=8'b0011_0100;
imem[5]<=8'b0010_0011;

imem[6]<=8'b0100_0000;
imem[7]<=8'b0001_0010;

imem[8]<=8'b0101_0111;
imem[9]<=8'b0010_0010;

imem[10]<=8'b0110_0010;
imem[11]<=8'b0001_0010;

imem[12]<=8'b0111_0001;
imem[13]<=8'b0001_0011;

Now the user wants to jump from imem[0] by 2 locations. He/ She will put out an instruction:
BRANCH $r1 $r2 2.
Don't confuse with PC code when I had incremented with 2 units. The user doesn't know anything about inside architecture. If He/ She wants to jump by 4 locations it will be as follows 0 --> 2 --> 4 --> 6 --> 8

Now the input to sign extend will be 0010 from imem[1] and 0010 is instruction[3:0]. Sign extension means to replicate the MSB to the bit position ahead of MSB. Hence, extended data will be 00000010. Shifting this left by 1 bit will give us 00000100. Now add this to "current" address which is imem[0] i.e. 0.The result will be 8'b00000000 + 8'b00000100 = 8'b00000100 which is 4. This 4 will be fed to PC and this his new instruction will be from imem[4], thus it successfully, jumped by 2 locations.

Do the same for imem[4]. instruction[3:0] shows that the user wants to jump by 3 instruction. Forget the opcode for now as this is intended just for an example. Here sign extended bit will be 00000011 and with shift left, it will be 00000110. Adding this to "current" location i.e. 4 which will give us 00000100 + 00000110 = 00001010 which is 10. Thus, our new instruction will be at imem[10].

For a 32-bit processor, we will have to shift by 2 bits. Still confused, how we are jumping? Comment and I'll solve your doubt. 


Control Unit

The Control Unit is what I would call as the heart of this processor. It is his responsibility to switch Mux and signals at the right stages in order to give the correct output. I have divided the stages as discussed above in 5 states.

1. Instruction Fetch---- s1
2. Instruction Decode--s2
3. Arithmetic Stage----s3
4. Memory Stage ------s4
5. Write Back Stage----s5

While working on the Control Unit, one must remember that, although every module has been provided with a clock, it does not mean, they will execute simultaneously with thorough data. For example, IF stage will take 1 CC to move data from input to output. ID stage will only be able to work on input when the IF stage sends the data on its output. Thus the 1st CC for ID is wasted and at the second CC, it works on the instruction received and itself takes 1 CC to produce its own output. In my code, s0 state is the initial state, means rest stage. From s0, it will move to s1 and then will keep rotating to s5 and back to s1.
Now in s0 stage, we will decide the signals required to prepare for stage s1. Similarly, when in stage s1, we will switch those signals which will be required for stage s2 and so on. At the end stage, s5, we will change signals which will be required for stage s1 / IF stage. So for the IF stage, we require pc_select signal to be set as 1, so a new instruction location can be sent to the IM. Note that you cannot switch the signal, when you are already in the state i.e. if the processor is in state s0, it cannot set pc_select at that state because it will get activated in the next clock cycle. So after the next CC, our processor will be in ID state with pc_select set as 1 which is useless or basically say a big error. 

Why? 

It is because pc_select will signal IM(Instruction Memory) to release a new instruction while the previous instruction is still in the ID(Instruction Decode) stage. At this moment, the destination register will change to the register address included in the new instruction. This will cause an error as we wanted to write our previous instruction's data to different register but now, it will get written to a different one. Hence, one can also conclude by this that, one has to hold the pc_select i.e. new instruction fetch until the longest instruction cycle has completed all of its states/ stages.  

For a better understanding, have a look at the below image to understand.


Black lines indicate processing of instruction A and green lines indicate processing of instruction B. One must be wondering that the WB stage of A instruction and IF of B instruction is happening at the same time. Well in that CC(clock cycle) we are fetching instruction for B. Until that CC completes, the output will not be available for the register file. While the previous instruction is ready to write at the write register. Overriding the write register address will take place at the next CC. i.e. the ID stage i.e. the 2nd CC for B signal. The IM has sent the address to Register File but the Register File will override the incoming register addresses until next CC. So at 6th CC, this will take place while the WB for A instruction will take place at the 5th CC.

For an overview explanation of the control unit again, all one has to remember is to always put out those signals which you want to activate for the next state. For example, state s0 will contain the switch of signals which will be required for s1 state. Similarly, s1 will contain signals that are required for s2. s2 state will contain signals which are required for s3. Similarly, s3 will control signals for s4 and s4 will control signals for s5 and s5 will control back for s1. Hence, when the processor is in the WB state, it will contain this piece of code.

pc_select <= 1;

which means that as soon as WB state (s5) completes its execution, pull the pc_select line high so the in the next state s1, thus s1 will have all the required components to work. Suppose you are hungry, so won't you want your mother to keep food ready at home as soon as you arrive there or will you choose to cook after reaching home. Similarly, each state will prepare food (switch appropriate signals) for the next state so as soon as the next state (you) arrive, you get your food (data flowing in switched signals).

Please comment, if you want a detailed explanation of Control Unit and its code.

The ISA of this processor consists of 15 instructions.


BRANCH Instruction working.

Have a look at those complex multiplexors in the datapath diagram at the top of the post. Well, it depends on the programmer, how he/ she decides to wire his circuits to execute the instructions. Now we have 4 Multiplexors. Mux4 will work for BEQ instruction which is Branch if Equal means if $r1 == $r2 then Branching will start. Mux5 will work for BNE instruction which is Branch if Not Equal means if $r1 is not equal to $r2 then Branching will start. Mux6 will work for BLT instruction which is Branch Less Than means if $r1 is less than $r2 Branching will start. Similarly, Mux7 is for BGT which is Branch Greater Than means if $r1 > $r2 Branching will start. 
  The select line for Mux1 (BEQ) is controlled by the output from AND gate. The inputs of AND (for BEQ) gate are Z flag from ALU and another from Control Unit. So if we want BEQ instruction, the select line from Control Unit is pulled HIGH. Now, there are two cases here. Firstly, if $r1 == $r2 then Z will go HIGH. Now the input for AND gate is 1 and 1, hence the output will be 1. Meanwhile, all other selects lines for Mux5, Mux6, Mux7 will remain low so the AND gate output will be 0 for those. When we concatenate AND gate outputs for each Mux signal, we will get 1000 for BEQ. Similarly, for BNE, we will get 0100. For BLT it is 0010 and finally, for BGT it is 0001. The select line for Mux8 i.e. sel8 is of 4-bit length. This select line is concatenated values of the select lines of each AND gate which controls the Muxes. So sel8 <= {sel4,sel5,sel6,sel7}.
Thus the controls for Mux8 will be

1000: in1 ------from Mux4
0100: in2 -------from Mux5
0010: in3 -------from Mux6
0001: in4 -------from Mux7
0000: out ---- from PC. Natural new instruction flow
else: b_out ------- New jump address without dependence on any Flags

For BRN, we need to pull down all select lines to Muxes via AND gate. Thus the output of AND gates will be low which will output 0. Its concatenation will be 0000. Suppose we want BEQ instruction but $r1 != $r2. This will put the Z flag to 0. For BEQ only select line for AND4 will be high to select Mux4. So AND of 0(Z) and 1 from Control Unit will be 0. Thus our output for sel8 will be again 0000 which means, carry out normal flow of instruction in sequence and do not jump. To branch directly all we have to do is pull all select lines high which go to then AND. This will set O/P to 0 thus all bits will be 0000 irrespective of flags.
                So the question here is, how will pulling all lines set the Mux8 to take the b_out (branch_out) instruction as an input. Suppose Z and Carry are equal to 0. Thus ~Z and ~Carry will 1. These signals go to each of the gates whose another input is already high by our Control Unit(Control Unit). So the O/P will be as follows for sel8.


AND4 inputs are Z and from CU
AND5 inputs are ~Z and from CU
AND6 inputs are Carry and from CU
AND7 inputs are ~Carry and from CU

Z = 0 Carry = 0 Control Units pulls all signal to 1
AND4: 0 & 1 = 0
AND5: 1 & 1 = 1
AND6: 0 & 1 = 0
AND7: 1 & 1 = 1
Concatenate result: 0101

Z = 0 Carry = 1 Control Units pulls all signal to 1
AND4: 0 & 1 = 0
AND5: 1 & 1 = 1
AND6: 1 & 1 = 1
AND7: 0 & 1 = 0
Concatenate result: 0110

Z = 1 Carry = 0 Control Units pulls all signal to 1
AND4: 1 & 1 = 1
AND5: 0 & 1 = 0
AND6: 0 & 1 = 0
AND7: 1 & 1 = 1
Concatenate result: 1001

Z = 1 Carry = 1 Control Units pulls all signal to 1
AND4: 1 & 1 = 1
AND5: 0 & 1 = 0
AND6: 1 & 1 = 1
AND7: 0 & 1 = 1
Concatenate result: 1010

Thus whenever the select line sel8 will have two 1's then it will jump to the new branch address (b_out).

Confused about something?
Comment!!


To get the code:

1. Disable your Pop-Ups
Upon clicking on the below link, there will be 4 pop up tabs which contain code.
Click here for Verilog Code