AI's Hardware Problem

Apr 26, 2023

If you want to watch the video first, it is below:

I worked on this video long before ChatGPT blew up the world so it had yet to come to my attention. But since then, AI has become a way bigger deal.

I have been thinking of a good follow up to it. Got any suggestions? Reply to this email or shoot me an email at jon@asianometry.com

Recently, I was listening to a podcast interview with OpenAI chief scientist Ilya Sutskever and one of those things that I found interesting was that he doesn’t think that hardware compute is a limitation for AI research. He also touches on the concepts mentioned in today’s newsletter, the bottleneck between logic and memory.

There seems to be still a lot of mystery in how AI hardware is built - especially the hardware that goes into data centers. I have tried digging to learn it myself too, but it seems most of it is proprietary. It would be nice to do more videos about the AI hardware concept in the coming months.

Since the deep learning explosion started in 2012, the industry's biggest models have grown hundreds of thousands of times.

Today, OpenAI's Dall-E 2 has 3.5 billion parameters. Google's Imagen has 4.6 billion and GPT-3 has 175 billion parameters.

Increasingly larger models are ahead. Google recently pre-trained a model with 1 trillion parameters.

These increasingly large models strain the ability of our hardware to accommodate them. And many of these limitations tie back to memory and how we use it.

In this video, we are going to look at deep learning's memory wall problem. And some of the memory-centric paradigms researchers are looking at to solve it.

The Von Neumann Computer

Virtually every modern computer runs what is called a Von Neumann architecture. Meaning that it stores both its instructions and data on the same memory bank.

At its heart are your processing units - a CPU or a GPU. Those processing units access memory to execute its instructions and process its data.

The Von Neumann architecture has been really good for us. It helped make software as powerful as it is today. But it works nothing like a real human brain.

The brain's compute ability is relatively low precision, but it tightly integrates that compute with memory and input/output communication.

Computers on the other hand run on high precision - 32 or 64 bit floating point arithmetic for instance - but separates that compute from memory and communication. This separation has consequences, especially for memory.

The Memory Wall

The AI hardware industry is scaling up memory and processing unit performance as fast as they can. Nvidia's V100 GPU released in 2017 had a 32 GB offering.

Today, the top-of-the-line Nvidia data center GPUs A100 and H100 sport 80GB of memory.

Despite this bulking-up, hardware performance is not keeping up with how fast these models are growing - especially when it comes to memory.

Memory allocations for leading edge models can easily exceed hundreds of gigabytes. Even with the latest parallelization techniques, a trillion-parameter model is estimated to require 320 A100 GPUs, each with 80 GB of memory.

These differences in processing and capacity mean that a processing unit wastes multiple processing cycles waiting not only for all the data to travel in and out of the memory, but also for the memory to perform its read/write operation.

This limitation is known as the Memory Wall, or a Memory Capacity Bottleneck.

Adding More

Alright then, the obvious solution would be then to add more memory to our GPUs right? What is stopping us from doing that?

There are practical and technology limits to how much extra memory you can add on. Not to mention the issues of connections and wiring. Just think about how widening a highway does not much help with traffic.

Additionally, there are very significant energy limitations associated with shuttling data between the chip and the memory. These electric connections have losses which cost energy.

I mentioned this in previous videos. Accessing memory off-chip uses 200 times more energy than a floating point operation.

80% of the Google TPU's energy usage is from its electrical connections rather than its logic computational units.

In some recent GPU and CPU systems, DRAM by itself accounts for 40% of the total system power.

Energy makes up 40% of a data center's operating costs. So for this reason, storage and memory has come to be a significant factor in the data center's ongoing profitability.

Economics

In addition to the significant operating costs of the energy are the up-front capital costs of purchasing the AI hardware itself.

As I mentioned earlier, a possible trillion-parameter model would need 320 A100 GPUs, each with 80 GB of memory.

A100s cost $32,000 each at MSRP, so that's a clean $10 million. A hundred trillion parameter model might require over 6,000 such GPUs.

That is just the cost of purchasing the hardware and does not even count the aforementioned energy costs of using these things to run inference on them, which is where 90% of a model's total costs are. It risks restricting the benefits of advanced AI only to the Uber rich tech giants or governments.

Memory Scaling

Much of these shortcomings are tied to historical and technological limits. In the 1960s and 70s, the industry adopted Dynamic Random Access Memory to form the basis of our computers.

This adoption was largely made for technological and economic reasons. DRAM memories had relatively low latency and were cheap to manufacture in bulk.

This worked fine for a while. As late as 1995, the memory industry was valued at $37 billion, with microprocessors at $20 billion.

But after 1980, compute scaling far outpaced memory scaling. This is because - generally speaking - the CPU or GPU industries have had just one metric to optimize towards: Transistor density.

The memory industry on the other hand not only has to scale DRAM capacity but also bandwidth and latency at the same time. Something has to give and usually that has been latency.

Over the past twenty years, memory capacity has improved 128 times and bandwidth 20 times.

Latency however has improved by just 30%.

Secondly, the memory industry realized that shrinking DRAM cells beyond a certain size gave you worse performance. Less reliability, less secure, worse latency, energy inefficiency, and so on.

Here's why. A DRAM cell stores 1 bit of data in the form of a charge within a capacitor - a capacitor being a device that stores electrical energy within a field. That bit is accessed using an access transistor.

As you scale the cell down to nanoscale sizes, that capacitor and its access transistor get leakier and more vulnerable to outside electrical noise. It also opens up new security vulnerabilities.

New Hardwares

These technical limitations and problems are fundamental to how the hardware works. Which makes them extremely difficult to engineer our way around. The industry is going to grind out small solutions, but those will be small.

So the problem also opens the door to brand new, radical ideas that might give us a possible 10x improvement over the current paradigm.

In a previous video, I talked about the silicon photonic AI accelerator, where we try to use light's properties to make data transfer more energy efficient.

Alongside that, we have another idea. Let's alleviate or even possibly eliminate the Von Neumann bottleneck and memory wall problems by making the memory do the computations themselves.

Compute in Memory

Compute in Memory refers to a random access memory with processing elements integrated together - the two might be very near each other or even be integrated onto the same die.

I have seen the idea be called other things throughout the years. Processing in Memory, Computational-RAM, Near-Data-Computing, Memory-centric Computing, In-Memory Computation, and so on.

I am aware that there are differences between these usages. But those differences are very subtle. I am generally stick to saying Compute-in-Memory.

The name has also been used to refer to concepts expanding on the SRAM idea. SRAM is often used for the memory cache that sits on-chip with the CPU. But what we are more referring to here is bringing processing and compute ability to the computer main memory itself.

The idea is well suited for deep learning. If you recall, running a neural network model is about calculating these massive matrices. The Google TPU had lots of circuits for running Multiply and Accumulate or MAC operations.

The actual arithmetic is relatively simple. The problem is that there is so much of it needs to be done. So in an ideal case, a Compute-in-Memory chip can execute MAC operations right inside the memory chip.

This is especially helpful for running inference on the edge outside the data center. These use cases have energy, size, or heat restrictions. Being able to cut up to 80% of a neural network's energy usage is a game changer.

An Idea Decades Old

The idea is decades old - dating back to the 1960s. Professor Harold Stone of Stanford University first notably explored the idea of "Logic in Memory".

Stone noted that the number of transistors in a microprocessor was growing very fast, but the processor's ability to communicate with its memory was limited by the number of pins. So he presented the idea of moving part of the computation into memory caches.

The 1990s saw further explorations of the idea. In 1995, Terasys produced what we would probably call the first processor-in-memory chip. It was a standard 4-bit memory with an integrated single bit logical unit.

This arithmetic logical unit can bring in data, apply some simple logic to it as dictated from a program, and then write it back to memory.

And then in 1997, various professors at UC Berkeley including Professor David Patterson - the inventor of RISC - created the IRAM project with the goal of putting a microprocessor and DRAM on the same chip.

Practical Shortcomings

Several other such proposals followed throughout the 1990s. But none of these ever caught on for a number of practical reasons.

First, memory and logic are hard to manufacture together. Their fabrication have sort of opposing goals. Again, logic transistors are all about speed and performance. But Memory transistors have to be about high density, low cost, and low leakage all at once.

It is hard to build a logic process with a DRAM process, and vice versa. DRAM designs are very regular, with lots of parallel wires. Logic designs on the other hand have much more complexity.

Circuit elements in a chip have connections - referred to as "metal layers". Having more metal layers allows for more complexity, but at the cost of current leakage and worse reliability.

Contemporary DRAM processes use 3 to 4 metal layers. Contemporary logic processes use anywhere from 7 to 12 and even more metal layers.

So it is estimated that if you tried to make logic circuits with a DRAM process, then the logic circuits would be 80% bigger and perform 22% worse.

And if you were to try to make DRAM cells with a logic process then you are essentially making Embedded DRAM or eDRAM. The cells use significantly more power, take up to 10x more space, and are less reliable.

Levels of Implementation: Device

The industry has since considered these manufacturing shortcomings and have come up with a variety of workarounds. Many proposals of Compute-in-Memory are on three levels - the Device, Circuit, and System.

The Device level leans on new types of memory hardware other than the conventional DRAM and SRAM memories. Notable examples include Resistive Random Access Memory or ReRAM or RRAM;

And spin-transfer torque magnetoresistive random access memory or STT-MRAM.

ReRAM is one of the more promising emerging memory technologies.

Like I mentioned earlier, conventional RAM memories store information using a charge stored within a capacitor.

ReRAM instead stores information by changing the electrical resistance of a certain material - referred to as a resistor - switching between a high and low resistance state. This structure allows ReRAM to compute logical functions directly within the memory cells. Please Don't ask me to try to explain it any further than that.

ReRAM is probably the emerging memory technology that is the closest to commercialization due to it being compatible with silicon CMOS. However there remain substantial hurdles to overcome before we see products arrive on shelves.

Circuits

The Circuit level is where we modify peripheral circuits to do the calculations right inside the SRAM or DRAM memory arrays themselves.

The phrase I see a lot is "in situ" computing. "In situ" meaning "locally" or "on site".

These are particularly clever. But they also require an intimate knowledge of how memory works, and still can be difficult to understand.

One prominent example of this is Ambit, an in-memory accelerator proposed by people from Microsoft, Nvidia, Intel, ETH Zurich, and Carnegie Mellon.

A DRAM memory contains subarrays with many rows of DRAM cells.

In normal use, the memory activates one row at a time.

This system activates three rows at a time in order to implement an AND/OR logic function. Two rows for the inputs, and one for the output.

The concept is logically attractive. You can utilize the memory's internal bandwidth to do all the calculations.

However, there are significant concerns. Ambit can perform basic AND, OR, and NOT logic operations. But it takes multiple cycles. Furthermore, more complex logic implementations like XNOR remain challenging to utilize.

System

So far, the big downside with these two compute-in-memory approaches is that their performance still falls short of what can be achieved with current Von Neumann, GPU/ASIC-centric systems.

In other words, they suffer the same drawbacks people saw in the 1990s. Putting memory and logic together still makes a "Jack of all trades, master of none" situation.

So the middle ground that the industry seems to be moving towards is implementing compute-in-memory at a system level. This is where we integrate together discrete processing units and memory at a very close level.

This is enabled thanks to new packaging technologies like 2.5D or 3D memory die stacking - where we stack a bunch of DRAM memory dies on top of a CPU die.

The memories are then connected to the CPU using thousands of channels called Through-Silicon Vias or TSVs. This gives you immense amounts of internal bandwidth.

AMD is working on something kind of like this with what they call 3D V-Cache - which is based on a TSMC 3D stacking packaging technology. They used it to add more memory cache to their processor chips.

There is a future where we can use similarly advanced packaging technologies to add hundreds of gigabytes of memory to an AI ASIC. This lets us integrate together world class memory and logic dies closer than ever before without needing to place them on the same die.

Conclusion

Ideas in the laboratory are cheap. What is harder is to execute on those ideas in a way that performs well enough to replace what is already out there on the market.

And the fact is that the current Nvidia A100 and H100 AI GPUs are a very formidable competitor.

But with leading edge semiconductor technologies slowing down the way they are, we need new ways to leapfrog towards more powerful and robust hardware for running AI.

Generally speaking, bigger models perform better. Today's best performing Natural Language Processing and Computer Vision models are great. But they still have ways to go. Which means they might have to get bigger so to get better.

But unless we develop new systems and hardware that can overcome these aforementioned limits, it seems that deep learning might fall short of fulfilling its great expectations.

Share The Asianometry Newsletter