Discover more from The Asianometry Newsletter
Running Neural Networks on Meshes of Light
In 2016, humanity took another loss when Google's AlphaGo defeated Lee Sedol at the game of Go.
But in the midst of that loss, some people comforted themselves with a small observation.
In terms of energy usage, the human brain is far more efficient than the computer.
The 2016 AlphaGo ran on 48 TPUs, each consuming about 40 watts. The human brain on the other hand with its 100 billion neurons runs on a meager 20 or so watts.
Neural networks have gotten increasingly bigger over the years. And with that, the amount of energy we need to run them has also grown.
Scientists have been looking for ways to bend the curve on this cost trend, and a very promising approach has been to use silicon photonics.
In this video, I want to talk about the growing efforts surrounding neural networks running on meshes of light.
Before we set out on this, I want to thank PhD student Alex Sludds from MIT for not only introducing me to this world of silicon photonics but also holding my hand and walking me through this ultra-complicated technology.
I will inevitably make some errors in this video, which will be a little more technical than most. Forgive me for them. These errors are mine alone, not Alex's.
Let us start at the beginning. Much of our recent advances in deep learning have been enabled by two pieces of silicon technology: GPUs and AI accelerators.
The GPU first allowed us to train large and accurate neural networks in a reasonable amount of time.
A big deal, but an arguably bigger development was when Google founded the multi-billion dollar AI accelerator industry with the Tensor Processing Unit or TPU.
These AI accelerators have helped us bring the benefits of machine learning to the general public by allowing us to run these models on our data at scale. With these, tools like Google Photos and the lot are possible for us to use.
Yet despite their massive impact, these accelerators are conceptually pretty simple. That is because they are geared to do one thing really well: Matrix multiplication.
In practice, neural networks are represented as mathematical matrices. When someone wants to "use" a pre-trained neural network model - like to identify what is in a picture - they use matrices.
The pre-trained model is a matrix. The image data is a matrix. Roughly speaking, when the model is being "run", you are multiplying those matrices together in search of a final answer.
Matrix multiplication is a simple math operation that requires the summation of the products of each of the matrices' entries.
Over 90% of running a pre-trained neural network - a process that the industry refers to as “inference” - involves matrix multiplication operations.
So inside the TPU is a 2-D array containing tens of thousands of multiply-accumulator circuits or MAC units.
Each MAC unit is geared for doing one thing: Multiplying two numbers at high precision and then adding that to an accumulation sum.
When in use, the MAC units retrieve a matrix of pre-determined values from the pre-trained neural network model - referred to as "weights" - so that they can multiply them with the image data matrix.
Knowing this, it makes perfect sense why the TPU outperforms a big Nvidia GPU. It's the semiconductor equivalent of a guy who only pumps iron with his right arm because he's a professional arm wrestler.
One of the big problems with this computing paradigm has been energy consumption. Like as I mentioned at the very start of this video, the brain is capable of doing what it does with far less power. Why?
Interestingly enough, if you crack open an AI accelerator like the Google TPU and look at how it uses its 40 watts of power, you will find that about 80% of that power budget is spent on connections and data transfer.
Every time the circuits move data around - like to load in the weights from the system memory - energy is used. This is because its connections are electric. A flow of electrons - negatively charged subatomic particles - circulating around the circuit.
Electrons can interact with other particles. And when they do, they generate losses in the form of heat.
Furthermore, AI accelerators are very parallel. Each of those MACs are running side by side on separate tasks, which lets them finish the whole project faster but at the cost of using more energy.
Companies and designers are aware of this and have designed their systems with it in mind. For instance, they might design the system to bring the model’s weights physically closer to the chip’s MAC circuits.
As a result of these efforts, AI accelerators have gotten to be 20 times more power efficient on a per-MAC basis than a GPU.
AI engineers are also doing more with less hardware. The AlphaGo of 2016 used 48 TPUs across a broad network.
But a year later, Google debuted AlphaZero, a better performing AI that uses just 4 TPUs in a single machine.
Despite all that, we still want to find ways to disruptively improve the energy efficiency of the hardware running these neural network models. It stands in the way of achieving artificial general intelligence, and it also costs companies a lot of money.
So what can we do? This is where silicon photonics comes into play.
I talked a little bit about silicon photonics in another video, so you can watch that if you want the whole tamale. Here's the season recap.
Silicon photonics chips replace the electrical connections used by traditional semiconductors with light-based ones. Using the same principles that allow us to send data through fiber optic cables, you can send light signals around the chip with hardly any losses.
Broadly speaking, the benefits of silicon photonics computing devices over traditional silicon devices are twofold.
First, very high bandwidth - 100 terahertz and above. You can transmit optical signals at super-high frequencies because those signals are moving at the speed of light.
Second, and more importantly, there is the potential of very low power consumption. The only energy used would be that for sending or receiving the light itself. That 80% or so of power that the TPU was using for data connections? Gone.
Of course, that's only the theory. In reality, there are limits to what can be practically achieved in the energy efficiency category. But the overall gain can still be a potential 10x improvement over the market status quo.
Okay. How does this look in practice?
The simplest implementation is a hybrid approach that fuses silicon photonics and traditional semiconductors. The traditional semiconductors will help store the inputs, the weights, and the intermediate results between neural network layers.
That data would then cross over to a photonic circuit through Digital to Analog bridges or DACs, where the crazy light bending stuff happens. The output comes back to the rest of the circuit through ADCs - analog to digital.
Let's take a look at a company called Lightmatter.
The MIT spinoff startup has raised over $100 million from venture capitalists and big enterprises for their approach to silicon photonics-enabled deep learning.
Their photonic circuit - which they refer to as a nanophotonic processor - replaces the Google TPU’s 2-D array of Multiply-Accumulator circuits with a mesh of silicon photonics components called the Mach-Zehnder interferometer or MZI.
The Mach-Zehnder is a basic building block of photonics. If you apply a voltage to it, then it can split and then recombine light in a specific way.
When this happens, the nature of that recombined light changes from its input state. The magnitude of that change can be mapped to a multiplication result. Yes, really.
With this, your weight and input matrices can be converted into arrangements of light - roughly speaking. You then execute the matrix multiplication’s computations by sending that light through the MZI photonic mesh.
Since light travels so fast, this calculation happens quickly - about 100 picoseconds. The photons that come out at the end can be detected, collected, and then mapped to a value that represents your computation result.
This photonic mesh produces working neural networks - albeit smaller ones with only a few neurons and layers. And as predicted, these small networks are 3 orders of magnitude more power efficient than their electrical peers. So this stuff actually works.
With all that being said, however, there are still a few other issues. Let's talk about the two biggest.
First, the accuracy isn't quite there yet. The Lightmatter team notes in their 2017 paper that their photonic neural network had about a 76.7% accuracy rating in recognizing vowel sounds. Simulations suggested that this should have done about 91%.
A large part of this has to do with how the system encodes and decodes the data from light - in other words, the conversion from analog to digital. Photons are weird and sometimes you get measurement errors.
There are ideas to improve this - for instance, improving the light's contrast to make it more detectable - but more time and investment is needed.
This brings up another point when it comes to running on photonic meshes and other analog computers of this nature. Because you are dealing with analog signals, you are unlikely to get the same accuracy that you can get with digital signals - as in, you cannot get as many places in your numbers.
For this reason, silicon photonics will probably remain a technology for running inference rather than training. Companies will still have to rely on big Nvidia GPUs for that purpose.
All that being said, there have been some recent interesting efforts. One group from Cornell has been able to train up unusual neural networks that lack the digital accuracy I just mentioned while using traditional digital methods. So perhaps this accuracy obstacle can be overcome.
Another issue is that of size and scale. The system profiled in the 2017 Lightmatter paper had just 2 layers and 56 Mach-Zehnder Inteferometers.
In the 1980s, Bell Labs produced optical transistors with the intention of making commercial computers, but the effort didn't live up to its promise in part due to the bulky size of the optics.
The advent of silicon photonics has helped us shrink these components somewhat, but MZIs are still rather bulky.
They are usually about 10,000 square micrometers, which is big in the semiconductor world.
To compare, the Nvidia A100 has hundreds of cores capable of trillions of floating point operations each second. In order to be commercially competitive, can a photonic mesh system scale up to far more layers and neurons while retaining its advantages?
In this particular space, the problem remains outstanding. What Lightmatter has done to tackle it was to produce bigger and bigger silicon photonics chips.
In 2020, they announced their Mars chip, which has a 64 by 64 matrix of Mach-Zehnder interferometers - fabbed on a 90 nanometer process.
Companies like GlobalFoundries and Intel have been investing a lot of money into their silicon photonics platforms. So it is very possible that these chips reach commercialization scale simply based on the fabrication improvements they make.
If they don't however, there have been some theoretical approaches to scaling without needing to fab out all those interferometers.
For instance, it is possible to get a 1-D row of interferometers to perform similarly to a 2-D mesh by essentially replacing one of the dimensions with a time dimension.
This offers a theoretical pathway to much larger neural networks without needing the silicon photonics fabrication technology to catch up.
Silicon photonics-powered neural networks sound like a marketing phrase, but the ideas scratch real world itches. Google created the TPU because they realized they needed some way to make it cheaper to run neural network models on real world data.
The same economic rationale applies here. Energy usage is a direct operating cost for these massive data centers. If light-powered neural networks can provide similar performance but with 10 times less energy usage, Google and Amazon will snap them up like hotcakes.
That is of course if these photonic meshes can perform. Competition in the space is heating up, with a number of startups in addition to Lightmatter jumping in with their own products. It shows that the technology has potential. It just needs to deliver on that promise.