UNIT 5

CISC, RISC, Basis MIPS Implementation, Pipelining, Instruction-level parallelism Parallel Processing Challenges, Flynn's Classification, Hardware Multi-threading, Multicore processing,

What is RISC and CISC Architectures


Hardware designers invent numerous technologies & tools to implement the desired architecture in order to fulfill these needs. Hardware architecture may be implemented to be either hardware specific or software specific, but according to the application both are used in the required quantity. As far as the processor hardware is concerned, there are 2 types of concepts to implement the processor hardware architecture. First one is RISC and other is CISC.

CISC Architecture

The CISC approach attempts to minimize the number of instructions per program, sacrificing the number of cycles per instruction. Computers based on the CISC architecture are designed to decrease the memory cost. Because, the large programs need more storage, thus increasing the memory cost and large memory becomes more expensive. To solve these problems, the number of instructions per program can be reduced by embedding the number of operations in a single instruction, thereby making the instructions more complex.


MUL loads two values from the memory into separate registers in CISC.
CISC uses minimum possible instructions by implementing hardware and executes operations.
Instruction Set Architecture is a medium to permit communication between the programmer and the hardware. Data execution part, copying of data, deleting or editing is the user commands used in the microprocessor and with this microprocessor the Instruction set architecture is operated.
The main keywords used in the above Instruction Set Architecture are as below.

Instruction Set: Group of instructions given to execute the program and they direct the computer by manipulating the data. Instructions are in the form – Opcode (operational code) and Operand. Where, opcode is the instruction applied to load and store data, etc. The operand is a memory register where instruction applied.

Addressing Modes: Addressing modes are the manner in the data is accessed. Depending upon the type of instruction applied, addressing modes are of various types such as direct mode where straight data is accessed or indirect mode where the location of the data is accessed. Processors having identical ISA may be very different in organization. Processors with identical ISA and nearly identical organization is still not nearly identical.

CPU performance is given by the fundamental law

Thus, CPU performance is dependent upon Instruction Count, CPI (Cycles per instruction) and Clock cycle time. And all three are affected by the instruction set architecture.

This underlines the importance of the instruction set architecture. There are two prevalent instruction set architectures

Examples of CISC PROCESSORS

IBM 370/168 – It was introduced in the year 1970. CISC design is a 32 bit processor and four 64-bit floating point registers.
VAX 11/780 – CISC design is a 32-bit processor and it supports many numbers of addressing modes and machine instructions which is from Digital Equipment Corporation.
Intel 80486 – It was launched in the year 1989 and it is a CISC processor, which has instructions varying lengths from 1 to 11 and it will have 235 instructions.

CHARACTERISTICS OF CISC ARCHITECTURE

(a).Instruction-decoding logic will be Complex. (b). One instruction is required to support multiple addressing modes. (c). Less chip space is enough for general purpose registers for the instructions that are 0operated directly on memory. (d). Various CISC designs are set up two special registers for the stack pointer, handling interrupts, etc. (e). MUL is referred to as a “complex instruction” and requires the programmer for storing functions.

RISC Architecture

RISC (Reduced Instruction Set Computer) is used in portable devices due to its power efficiency. For Example, Apple iPod and Nintendo DS. RISC is a type of microprocessor architecture that uses highly-optimized set of instructions. RISC does the opposite, reducing the cycles per instruction at the cost of the number of instructions per program Pipelining is one of the unique feature of RISC. It is performed by overlapping the execution of several instructions in a pipeline fashion. It has a high performance advantage over CISC.


RISC ARCHITECTURE CHARACTERISTICS

(a). Simple Instructions are used in RISC architecture.
(b). RISC helps and supports few simple data types and synthesize complex data types.
(c). RISC utilizes simple addressing modes and fixed length instructions for pipelining.
(d). RISC permits any register to use in any context.
(e). One Cycle Execution Time
(f). The amount of work that a computer can perform is reduced by separating “LOAD” and “STORE” instructions.
(g). RISC contains Large Number of Registers in order to prevent various number of interactions with memory.
(h). In RISC, Pipelining is easy as the execution of all instructions will be done in a uniform interval of time i.e. one click.
(i). In RISC, more RAM is required to store assembly level instructions.
(j). Reduced instructions need a less number of transistors in RISC.
(k). RISC uses Harvard memory model means it is Harvard Architecture.
(l). A compiler is used to perform the conversion operation means to convert a high-level language statement into the code of its form.

MUL instruction is divided into three instructions
“LOAD” – moves data from the memory bank to a register
“PROD” – finds product of two operands located within the registers
“STORE” – moves data from a register to the memory banks
The main difference between RISC and CISC is the number of instructions and its complexity.

Basic MIPS Implementation

1. Instruction fetch cycle (IF):

IR = Mem[PC];

NPC = PC + 4; Operation:

Send out the PC and fetch the instruction from memory into the instruction register (IR). Increment the PC by 4 to address the next sequential instruction.

IR - holds instruction that will be needed on subsequent clock cycles Register NPC - holds next sequential PC.

2. Instruction decode/register fetch cycle (ID):

A = Regs[rs];

B = Regs[rt];

Imm = sign-extended immediate field of IR; Operation:

Decode instruction and access register file to read the registers (rs and rt -register specifiers). Outputs of general purpose registers are read into 2 temporary registers (A and B) for use in later clock cycles.

Lower 16 bits of IR are sign extended and stored into the temporary register Imm, for use in the next cycle.

3. Execution/effective address cycle (EX):

* ALU operates on the operands prepared in the prior cycle, performing one of four functions depending on the MIPS instruction type.

i) Memory reference:

ALUOutput = A + Imm;

ii) Register-Register ALU instruction:

ALUOutput = A func B;

Operation:

a) ALU performs the operation specified by the function code on the value in register A and in register B.

b) Result is placed in temporary register ALUOutput

c) iii) Register-Immediate ALU instruction:

ALUOutput = A op Imm;

Operation:

a) ALU performs operation specified by the opcode on the value in register A and register Imm.

b) Result is placed in temporary register ALUOutput.

iv)Branch:

ALUOutput = NPC + (Imm << 2);

Cond = (A == 0)

Operation:

a) ALU adds NPC to sign-extended immediate value in Imm, which is shifted left by 2 bits to create a word offset, to compute address of branch target.

b) Register A, which has been read in the prior cycle, is checked to determine whether branch is taken.

c) Considering only one form of branch (BEQZ), the comparison is against 0.

4. Memory access/branch completion cycle (MEM):

* PC is updated for all instructions: PC = NPC; i. Memory reference:

LMD = Mem[ALUOutput] or

Mem[ALUOutput] = B;

Operation:

a) Access memory if needed.

b) Instruction is load-data returns from memory and is placed in LMD (load memory data)

c) Instruction is store-data from the B register is written into memory

ii.Branch:

if (cond) PC = ALUOutput

Operation: If the instruction branches, PC is replaced with the branch destination address in register ALUOutput.

5.Write-back cycle (WB):

* Register-Register ALU instruction: Regs[rd] = ALUOutput;

* Register-Immediate ALU instruction: Regs[rt] = ALUOutput;

* Load instruction:

Regs[rt] = LMD;

Operation: Write the result into register file, depending on the effective opcode.

Pipelining

Basic concepts
· Pipelining is an implementation technique whereby multiple instructions are overlapped in execution
· Takes advantage of parallelism.
· Key implementation technique used to make fast CPUs.
· A pipeline is like an assembly line.

In a computer pipeline, each step in the pipeline completes a part of an instruction.Like the assembly line, different steps are completing different parts of different instructions in parallel. Each of these steps is called a pipe stage or a pipe segment.
· The stages are connected one to the next to form a pipe—instructions enter at one end, progress through the stages, and exit at the other end.
· The throughput of an instruction pipeline is determined by how often an instruction exits the pipeline.
· The time required between moving an instruction one step down the pipeline is a processor cycle..
· If the starting point is a processor that takes 1 (long) clock cycle per instruction, then pipelining decreases the clock cycle time.
· Pipeline for an integer subset of a RISC architecture that consists of load-store word, branch, and integer ALU operations.
Every instruction in this RISC subset can be implemented in at most 5 clock cycles. The 5 clock cycles are as follows.
1. Instruction fetch cycle (IF):
Send the program counter (PC) to memory and fetch the current instruction from memory. PC=PC+4
2. Instruction decode/register fetch cycle (ID):
Ø Decode the instruction and read the registers. Ø Do the equality test on the registers as they are read, for a possible branch. Ø Compute the possible branch target address by adding the sign-extended offset to the incremented PC. Ø Decoding is done in parallel with reading registers, which is possible because the register specifiers are at a fixed location in a RISC architecture,known as fixed-field decoding
3. Execution/effective address cycle(EX):
The ALU operates on the operands prepared in the prior cycle, performing one of three functions depending on the instruction type. Ø Memory reference: The ALU adds the base register and the offset to form the effective address. Ø Register-Register ALU instruction: The ALU performs the operation specified by the ALU opcode on the values read from the register file Ø Register-Immediate ALU instruction: The ALU performs the operation specified by the ALU opcode on the first value read from the register file and the sign-extended immediate.
4. Memory access(MEM):
Ø If the instruction is a load, memory does a read using the effective address computed in the previous cycle. Ø If it is a store, then the memory writes the data from the second register read from the register file using the effective address.
5. Write-back cycle (WB) :
Register-Register ALU instruction or Load instruction: Write the result into the register file, whether it comes from the memory system (for a load) or from the ALU (for an ALU instruction).

Instruction-level Parallelism

Pipelining can overlap the execution of instructions when they are independent of one another. This potential overlap among instructions is called instruction-level parallelism (ILP) since the instructions can be evaluated in parallel.

The amount of parallelism available within a basic block ( a straight-line code sequence with no branches in and out except for entry and exit) is quite small. The average dynamic branch frequency in integer programs was measured to be about 15%, meaning that about 7 instructions execute between a pair of branches.

Since the instructions are likely to depend upon one another, the amount of overlap we can exploit within a basic block is likely to be much less than 7.

To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks.

The simplest and most common way to increase the amount of parallelism available among instructions is to exploit parallelism among iterations of a loop. This type of parallelism is often called loop-level parallelism.
 

Example 1

for (i=1; i<=1000; i= i+1)
  x[i] = x[i] + y[i];
This is a parallel loop. Every iteration of the loop can overlap with any other iteration, although within each loop iteration there is little opportunity for overlap.
 

Example 2

for (i=1; i<=100; i= i+1){
  a[i] = a[i] + b[i];         //s1
  b[i+1] = c[i] + d[i];      //s2
}
Is this loop parallel? If not how to make it parallel?

Statement s1 uses the value assigned in the previous iteration by statement s2, so there is a loop-carried dependency between s1 and s2. Despite this dependency, this loop can be made parallel because the dependency is not circular:
 -  neither statement depends on itself;
 - while  s1 depends on s2, s2 does not depend on s1.

A loop is parallel unless there is a cycle in the dependecies, since the absence of a cycle means that the dependencies give a partial ordering on the statements.

To expose the parallelism the loop must be transformed to conform to the partial order. Two observations are critical to this transformation:

There is no dependency from  s1 to s2. Then, interchanging the two statements will not affect the execution of s2.
On the first iteration of the loop, statement s1 depends on the value of b[1] computed prior to initiating the loop.
This allows us to replace the loop above with the following code sequence, which makes possible overlapping of the iterations of the loop:
a[1] = a[1] + b[1];
for (i=1; i<=99; i= i+1){
  b[i+1] = c[i] + d[i];
  a[i+1] = a[i+1] + b[i+1];
}
b[101] = c[100] + d[100];
 

Example 3

for (i=1; i<=100; i= i+1){
  a[i+1] = a[i] + c[i];       //S1
  b[i+1] = b[i] + a[i+1];     //S2
}
This loop is not parallel because it has cycles in the dependencies, namely the statements S1 and S2 depend on themselves!
 

There are a number of techniques for converting such loop-level parallelism into instruction-level parallelism. Basically, such techniques work by unrolling the loop

.

An important alternative method for exploiting loop-level parallelism is the use of

Theorem 1.2: For any set A, we have Ø ? A ? U.

Parallel Processing Challenges

An ideal processor is one where all constraints on ILP are removed. The only limits on ILP in such a processor are those imposed by the actual data flows through either registers or memory.

PARALLEL PROCESSING CHALLENGES

Limitations of ILP

The Hardware Model

An ideal processor is one where all constraints on ILP are removed. The only limits on ILP in such a processor are those imposed by the actual data flows through either registers or memory.

The assumptions made for an ideal or perfect processor are as follows:

1.Register renaming

There are an infinite number of virtual registers available, and hence all WAW and WAR hazards are avoided and an unbounded number of instructions can begin execution simultaneously.

2.Branch prediction

Branch prediction is perfect. All conditional branches are predicted exactly.

3.Jump prediction

All jumps (including jump register used for return and computed jumps) are perfectly predicted. When combined with perfect branch prediction, this is equivalent to having a processor with perfect speculation and an unbounded buffer of instructions available for execution.

4.Memory address alias analysis

All memory addresses are known exactly, and a load can be moved before a store provided that the addresses are not identical. Note that this implements perfect address alias analysis.

5.Perfect caches

All memory accesses take 1 clock cycle. In practice, superscalar processors will typically consume large amounts of ILP hiding cache misses, making these results highly optimistic.

To measure the available parallelism, a set of programs was compiled and optimized with the standard MIPS optimizing compilers. The programs were instrumented and executed to produce a trace of the instruction and data references. Every instruction in the trace is then scheduled as early as possible, limited only by the data dependences. Since a trace is used, perfect branch prediction and perfect alias analysis are easy to do. With these mechanisms, instructions may bescheduled much earlier than they would otherwise, moving across large numbers of instructions on which they are not data dependent, including branches, since branches are perfectly predicted.

The effects of various assumptions are given before looking at some ambitious but realizable processors.

Limitations on the Window Size and Maximum Issue Count

To build a processor that even comes close to perfect branch prediction and perfect alias analysis requires extensive dynamic analysis, since static compile time schemes cannot be perfect. Of course, most realistic dynamic schemes will not be perfect, but the use of dynamic schemes will provide the ability to uncover parallelism that cannot be analyzed by static compile time analysis. Thus, a dynamic processor might be able to more closely match the amount of parallelism uncovered by our ideal processor.

The Effects of Realistic Branch and Jump Prediction

Our ideal processor assumes that branches can be perfectly predicted: The outcome of any branch in the program is known before the first instruction is executed! Of course, no real processor can ever achieve this. We assume a separate predictor is used for jumps. Jump predictors are important primarily with the most accurate branch predictors, since the branch frequency is higher and the accuracy of the branch predictors dominates.

1.Perfect All branches and jumps are perfectly predicted at the start of execution.

2.Tournament-based branch predictor —The prediction scheme uses a correlating 2-bit predictor and a noncorrelating 2-bit predictor together with a selector, which chooses the best predictor for each branch.

The Effects of Finite Registers

Our ideal processor eliminates all name dependences among register references using an infinite set of virtual registers. To date, the IBM Power5 has provided the largest numbers of virtual registers: 88 additional floating-point and 88 additional integer registers, in addition to the 64 registers available in the base architecture. All 240 registers are shared by two threads when executing in multithreading mode, and all are available to a single thread when in single-thread mode.

The Effects of Imperfect Alias Analysis

Our optimal model assumes that it can perfectly analyze all memory dependences, as well as eliminate all register name dependences. Of course, perfect alias analysis is not possible in practice: The analysis cannot be perfect at compile time, and it requires a potentially unbounded number of comparisons at run time (since the number of simultaneous memory references is unconstrained).

The three models are

1. Global/stack perfect—This model does perfect predictions for global and stack references and assumes all heap references conflict. This model represents an idealized version of the best compiler-based analysis schemes currently in production. Recent and ongoing research on alias analysis for pointers should improve the handling of pointers to the heap in the future.

2. Inspection—This model examines the accesses to see if they can be determined not to interfere at compile time. For example, if an access uses R10 as a base register with an offset of 20, then another access that uses R10 as a base register with an offset of 100 cannot interfere, assuming R10 could not have changed. In addition, addresses based on registers that point to different allocation areas (such as the global area and the stack area) are assumed never to alias. This analysis is similar to that performed by many existing commercial compilers, though newer compilers can do better, at least for looporiented programs.

3. None All memory references are assumed to conflict. As you might expect, for the FORTRAN programs (where no heap references exist), there is no difference between perfect and global/stack perfect analysis

Flynn's Classification

In 1966, Michael Flynn proposed a classification for computer architectures based on the number of instruction steams and data streams (Flynn’s Taxonomy).

· Flynn uses the stream concept for describing a machine's structure.

· A stream simply means a sequence of items (data or instructions).

· The classification of computer architectures based on the number of instruction steams and data streams (Flynn’s Taxonomy).

Flynn’s Taxonomy

· SISD: Single instruction single data

– Classical von Neumann architecture

· SIMD: Single instruction multiple data

· MISD: Multiple instructions single data

– Non existent, just listed for completeness

· MIMD: Multiple instructions multiple data

– Most common and general parallel machine

SISD

· SISD (Singe-Instruction stream, Singe-Data stream)

· SISD corresponds to the traditional mono-processor ( von Neumann computer). A single data stream is being processed by one instruction stream

· A single-processor computer (uni-processor) in which a single stream of instructions is generated from the program.

SIMD

· SIMD (Single-Instruction stream, Multiple-Data streams)

· Each instruction is executed on a different set of data by different processors i.e multiple processing units of the same type process on multiple-data streams.

· This group is dedicated to array processing machines.

· Sometimes, vector processors can also be seen as a part of this group.

MISD

· MISD (Multiple-Instruction streams, Singe-Data stream)

· Each processor executes a different sequence of instructions.

· In case of MISD computers, multiple processing units operate on one single-data stream .

· In practice, this kind of organization has never been used

MIMD

· MIMD (Multiple-Instruction streams, Multiple-Data streams)

· Each processor has a separate program.

· An instruction stream is generated from each program.

· Each instruction operates on different data.

· This last machine type builds the group for the traditional multi-processors. Several processing units operate on multiple-data streams

Hardware Multi-threading notes

•Single thread runs until a costly stall

for (i=1; i<=1000; i= i+1)
 E.g. 2nd level cache miss

• Another thread starts during stall for first

-Pipeline fill time requires several cycles!

•Does not cover short stalls

•Less likely to slow execution of a single thread (smaller latency)

•Needs hardware support

- PC and register file for each thread

-little other hardware

Fine-grained multithreading

•Two or more threads interleave instructions

-Round-robin fashion

-Skip stalled threads

•Needs hardware support

-Separate PC and register file for each thread

-Hardware to control alternating pattern

•Naturally hides delays

-Data hazards, Cache misses

-Pipeline runs with rare stalls

•Does not make full use of multi-issue architecture

Simultaneous Multithreading (SMT)

•Instructions from multiple threads issued on same cycle

-Uses register renaming and dynamic scheduling facility of multi-issue architecture

•Needs more hardware support

-Register files, PC’s for each thread

-Temporary result registers before commit

-Support to sort out which threads get results from which instructions

•Maximizes utilization of execution units

Conceptual Diagram

Coarse Multithreading

Fine Multithreading

Simultaneous Multithreading

Multicore processing

History

•In the early 1970’s the first Microprocessor was developed by Intel.
•It was a 4 bit machine that was named the 4004
•The 4004 was followed by Intel’s 8008 and 8080, as well as motorola’s 6800 and 68000

Growth

With each new generation of processors there were several developments such as:
•Smaller size
•Faster
•Increased heat dissipation
•Greater Consumption of power

Single Core Performance

On technique used to increase single core performance was:
•Pipelining: beginning other waiting instructions before the first finishes

Another technique was multithreading
•Multithreading involves execution of two separate threads.
•Time is divided and interlaced between the two threads in order to simulate simultaneous execution

Problems with Single Core

To execute the tasks faster you must increase the clock time.
Increasing clock times too high drastically increases power consumption and heat dissipation to extremely high levels, making the processor inefficient.

Multi Core solution

Creating two cores or more on the same Die increases processing power while keeping clock speeds at an efficient level.
A processor with 2 cores running at efficient clock speeds can process instructions with similar speed to a single core processor running at twice the clock speed, yet the dual core processor would still consume less energy.

Multi-Core Advantages

While working with many threads, a Multi Core processor with n cores can execute n threads simultaneously by assigning a core to each thread. If it must process more than n threads , say x, it can apply multithreading procedures with each core working with an average of x/n threads.
A Single core processor must multithread with every single thread.

Other Incentives

Improving an existing single core design or creating a better one is problematic in that it:
•Is time consuming to design
•Is risky to modify existing designs


Creating multi core processors is convenient in that:
•The name “core dual” and similar names are good for marketing.
•It has lower manufacturing costs.
•Uses proven processor designs.

Implementations

Two main ways to have multiple cores interact are the shared memory model, and the distributed memory model.
•In the shared memory model, all cores share the same cache memory.
•In the distributed memory model, each core has its own cache memory.

The Intel core duo design has a separate L1 cache memory for each core, but both cores share an L2 cache.

The AMD Athlon 64 X2 implementation has separate L1 and L2 cache memory for each core.

Problem

Some of the current problems found with multi core processors include:
•Memory/Cache coherence. As mentioned earlier, some implementations have distributed L1 caches but must share an L2 cache. This poses the problem of making sure each core keeps the other updated with changes in the data in its own cache.
•Multi threading is also a problem when the software being run is not designed to take advantage of the multi core processor. This may mean that one core does most of the work which means that the processor is running no more efficiently than a single core.

Future of Multi Core

•Current debates argue over whether future multi core processors should be homogenous or heterogenous.
•That is, should all the cores be the same or should there be a mix of different types?
•Having all cores be the same makes production easier and keeps its complexity to a minimum.
•Having different cores that are specialized in specific tasks increases complexity but has the potential to be much more efficient in speed and power consumption.