The basics of Hyper-Threading
Hyper-threading is an optimization technique that allows you to run more than one thread on a single CPU core. It is Intel’s implementation of a technique known as simultaneous multithreading (SMT). This article covers only the basics of hyper-threading and is meant only as an introduction. The article tries to answer the following questions:
What is hyper-threading?
How does hyper-threading improve performance?
What kind of performance improvements can we expect?
Can hyper-threading have a negative impact on performance?
To understand hyper-threading, you need to have some knowledge about how CPUs work. The first half of the article covers some CPU basics, the second half covers hyper-threading. Please don’t be discouraged by the CPU basics section, it is quite simple. Also, it is information that might be useful for those looking to buy a new processor/computer.
Clock speed and the number of instructions per clock cycle
Clock speed refers to the number of clock cycles or clock ticks per second.
A common misconception
Many people try to calculate the performance of a CPU purely based on the clock speed and the number of cores. The calculation goes something like this: a 4Ghz CPU has 4 cores, so the throughput should be 4 * 4 = 16 billion instructions per second. However, there is no such direct relationship between clock speed and throughput. It takes many clock cycles to finish an instruction. The number of clock cycles needed is different in different CPU architectures. Some of the Pentium 4 CPUs used to take a crazy 31 clock cycles.
Are CPUs much slower than you think?
Based on what we know from the previous section, it would seem like CPUs are pretty slow. CPU performance depends on the set of instructions being executed. In a worst-case scenario, the performance really is not that impressive. But in an average case, CPUs are much faster than what our simplistic calculation shows. A single core of a modern CPU can crank out multiple instructions in a clock cycle.
Black Magic?
How is it possible for a single core to complete multiple instructions in a clock cycle when it takes multiple clock cycles to finish one instruction? The answer to this mystery is something called pipelining. CPUs have multiple execution units in a single core. Pipelining helps us to utilize these units. This is called instruction-level parallelism. We will take a closer look at pipelining in the next section.
Pipelining
While it takes many clock cycles to finish an instruction, it is possible to start the next instruction before the current instruction is finished. This is done by splitting the instruction into multiple steps. This is called pipelining. At any given time there would be multiple instructions in the pipeline at different stages of completion.
Modern CPUs can dispatch multiple instructions to the execution core in the same clock cycle. So the actual execution can happen in parallel.
A Factory Analogy
Pipelining works something like the assembly line of a factory. Let us look at a simplified example. We have a factory assembling parts for a car. The factory needs to install three parts: the engine, the hood, and the wheels. If the factory has only one station, work can be done on only one car at a time, even if they have a different crew for each task. But if there are three stations and three crews, work can be done on three cars at the same time.
Back to CPUs, each operation is split into multiple steps. At the end of a clock cycle, the results of one step are fed into the input of the next step. This process continues until we reach the last step.
Number of pipeline stages and performance
Now, let us look at the impact the number of pipeline steps (typically called pipeline stages) has on performance. Consider the following simplified example: a CPU takes 5 stages to complete an instruction and each stage takes 2 milliseconds. What if we redesign the CPU to complete the instruction in 12 stages with each stage taking 1 millisecond. Now it would take 12 milliseconds (12 * 1) to complete the very first instruction. That is more than the 10 milliseconds (5 * 2) required earlier. But if we keep the pipeline full, we can crank out an instruction every millisecond. That is double the performance of the first design.
What this means is that we can increase the best-case performance of a CPU core by simply doing less work each clock cycle. This is what Intel used to do during the megahertz race. As you would have noticed, the average performance, the one that really matters, is dependent on keeping the pipeline reasonably full.
Challenges of pipelining
So is it easy to keep the pipeline full? Unfortunately no; most of the time there will be a lot of empty slots in the pipeline.
info: Empty slots in the pipeline are called pipeline bubbles
The backend of a CPU has integer units, floating-point units, and some special-purpose units. In modern CPUs, these units are fully independent and can work in parallel. They perform operations like addition, multiplication, division, etc. But we don’t need all of these at the same time. So keeping the pipeline full is a nearly impossible task. We will look at some of the other challenges of pipelining in the sections below.
Program accuracy and pipelining
If you have some familiarity with programming, you might have noticed a potential deal-breaker in the whole pipelining idea. The job of a CPU is to execute instructions in the order they appear in the program, one after the other. It is supposed to do parallel execution only if the program specifically asks for it.
For example, a program might contain an instruction to fetch a number from a register, multiply it with another number and store the result. The next instruction might be to fetch the result and add a number to it. The first instruction has to be completed in full before the second instruction can begin. This is the contract between the program and the CPU; pipelining breaks that contract.
How do CPUs use pipelining and still maintain program accuracy?
The CPU does a real-time analysis of instructions coming in. If an instruction has a dependency on a previous incomplete instruction, it will be delayed.
Out of Order Execution
Delaying dependent instructions solves the accuracy problem, but it introduces a serious performance bottleneck. Single-threaded programs are full of instruction dependencies and they tend to appear close to each other. If we have to stop filling the pipeline every time we come across a dependency, there isn’t much utility in using a pipeline. Please note that this is a big problem only for processors with deep pipelines like the Intel/AMD x86 processors.
info: If a pipeline has a large number of stages, it is called a deep pipeline.
To get around this problem, the CPU does not execute instructions in the order they arrive. It fetches a batch of instructions, analyses them, re-orders them, and puts only independent instructions in the pipeline. This would seem like yet another breach of contract between the program and the CPU. But the CPU would make it appear that the instructions were executed sequentially. The results of operations are first stored in CPU registers. When it comes to writing results to memory, the order of instructions is preserved.
Branch Prediction
Out-of-order execution helps to improve pipeline utilization, but there is still one more thorn in our flesh. These are jump instructions; they tell the CPU to repeat certain instructions or skip certain instructions. The trouble is that the decision to jump or not depends on the results of previous instructions. If the results are not yet available, we have a problem.
To put it another way, there are multiple branches a program might take. We don’t know in advance which branch the program will end up taking. Programs are full of branch instructions and the CPU cannot afford to stop queuing instructions every time it hits a branch. Instead, it predicts which branch is more likely to be taken and executes the instructions in that branch. This type of execution is called speculative execution.
The CPU will let the results of speculative execution leave it only if the speculation turns out to be successful. Thus only the results from the correct branches will be visible to the outside world.
When branch prediction fails, the CPU will have to clean up the results from the wrong branch and start executing the correct branch. This will be a big hit on performance. Now you can probably see the problem of having a very deep pipeline. They are hard to fill and there will be a lot of speculative execution going on in them.
Context Switching
Let us take a break from pipelining and take a look at a new topic. Modern computers have a large number of processes running at any given time. The total number of processes would easily exceed the number of cores in the CPU. But the OS with the help of the CPU is able to create the impression that all these processes are running simultaneously.
The OS does not allow any single process to monopolize a core. It rapidly switches processes in and out of the CPU, creating an illusion of concurrency. When the OS kicks one process out of the CPU and schedules a different one, it triggers what is known as a context switch.
When a context switch happens the CPU would save the state or context of the process in memory. The process state includes things like the program counter, register values, etc. When it is time to run the same process again the context data is fetched from memory and restored to the CPU. The execution of the process continues from where it left off. Memory access is time-consuming, making context switches expensive.
Hyper-threading
Now that we have the basics out of the way, it is time to take a closer look at hyper-threading. As mentioned earlier hyper-threading allows a CPU core to run more than one thread simultaneously. With hyper-threading one physical core would appear like two logical or virtual cores. The operating system can schedule threads and processes on these virtual cores the same way it can on physical cores.
In order to convert a normal core to a hyper-threaded core, some of the resources in the CPU have to be duplicated, partitioned, or shared.
Duplicated Resources The two threads need their own copy of certain resources. This includes the instruction pointer, register renaming logic, and some registers.
Partitioned Resources Partitioned resources mainly consist of some buffers and queues. When both threads are running, the resource is partitioned, when only one thread is running, it can use the whole resource. This is to prevent one thread from monopolizing certain resources, making sharing of resources difficult.
Shared Resources Full utilization of some of the resources like the execution units is difficult. Sharing these resources is the key objective of hyper-threading. But some resources like the cache which does not have a utilization problem need to be shared as well; more on this later.
According to Intel, only minimal changes are needed to convert a normal core to a hyper-threaded core.
How does hyper-threading improve performance?
It takes a long time to transfer data between CPU and memory. One of the advantages of hyper-threading is that it can mitigate the memory latency problem to an extent.
We looked at context switching earlier. With hyper-threading, the need for context switching is reduced since double the number of processes can run at the same time.
Without hyper-threading, the pipeline could get stuck when a process needs to access memory. With hyper-threading, instructions from the other thread can still run.
The benefits of hyper-threading go well beyond mitigating the memory latency problem. It also improves the utilization of execution units. From our earlier discussion about pipelining, it should be clear that maximizing pipeline utilization is both challenging and rewarding. Dependencies between instructions are the bane of pipelining and x86 CPUs have deep pipelines.
With hyper-threading, we are putting instructions belonging to different threads on the same CPU core. It reduces dependencies drastically. From the point of view of the CPU, instructions belonging to separate threads are independent of each other. It can schedule them in any order it likes. Hyper-threading can lead to a significant reduction in pipeline bubbles. It also reduces the number of branch predictions and consequent branch failures.
What kind of performance improvements can we expect?
Unfortunately, the answer to this question is that it depends. It depends on the workload and the CPU architecture. There is no guarantee that hyper-threading will lead to better utilization of execution units. Let us say, we have two threads doing heavy floating-point arithmetic on the same core. This happens to be a common occurrence. In this case, the integer units will remain idle most of the time.
There is a common misconception that a core with hyper-threading is the same as having two cores. While a doubling of performance is possible in theory, it is not likely to happen in practice. It can happen only when pipeline utilization of a particular thread is extremely poor and the second thread happens to complement the first thread perfectly.
Think of hyper-threading as a way of optimizing the performance of a single core and you won’t be disappointed. People often talk about a 30% improvement in performance, this is realistic, but please don’t quote me on this one. Also note that, from a technical point of view, improvements provided by hyper-threading are free. An extra core would mean, more transistors, more power draw, and increased heat dissipation.
Ultimately the only way to measure performance is benchmarking. Clock speed and the core count do not mean anything when comparing CPUs with different architectures. However, when comparing CPUs with the same architecture, we can get an idea about which CPU is better by looking at the specs. For example, we have already established that for the same CPU architecture, having an extra core is better than hyper-threading. But if you need more than just a ranking, you still need benchmarking.
Can hyper-threading have a negative impact on performance?
When looking for the negatives of hyper-threading, the most obvious place to look would be the duplicated resources. But it turns out that duplicating the resources needed is pretty cheap.
The biggest drawback of hyper-threading is cache contention. With hyper-threading enabled, the threads have to share the same cache. In early CPUs with hyper-threading, this often caused severe performance degradation, leading many people to turn off hyper-threading.
Contention for some of the other shared resources can also be a problem. CPU does not always know which instruction belongs to which thread and cannot always prevent a thread from monopolizing a resource.
The good news is that newer CPUs do a much better job in these areas. I have seen only performance gains in recent benchmarks. But I must admit that people don’t test hyper-threading often enough and as always performance depends on your workload.