Meet the engineer who believes he can reduce AI’s energy consumption by up to 30%

Huge savings are possible simply by changing the maths of tensor computation, says BitEnergy AI’s Hongyin Luo

Image:
Meet the engineer who believes he can reduce AI’s energy consumption by up to 30%

Replacing computationally complex floating-point tensor multiplication with the much simpler integer addition is 20 times more efficient. Together with incoming hardware improvements this promises much more efficient training and inferencing in the future, says Hongyin Luo.

"Efficiency has always been at the top of my mind”, says Hongyin Luo, founder of MIT spinoff BitEnergy AI. Luo has spent years researching ways to tame the colossal energy appetite of large language models, creating the startup to put his PhD and post doc studies to beneficial use.

As an illustration of why AI efficiency is such a hot topic, a recent study found the average energy consumed by ChatGPT in early 2023 was 564 MWh per day, equivalent to 18,000 average US homes, while Google’s AI services could be gobbling up as much as power as the whole of Ireland. What’s more, inference, the process of querying a model, can use as much energy or more as training in some cases.

In what Luo believes could be a significant breakthrough, he and his collaborators found that by simplifying the computations in key processes used in model training and inferencing (namely the attention mechanism and linear transformations in the transformer) it is possible to reduce energy consumption of these processes (which constitute about 30% of the total energy used by a LLM over its lifecycle) by an order of magnitude, with very little adverse effect on performance.

The maths may be complicated, but the concept is straightforward. Luo's method involves replacing computationally complex floating-point tensor multiplication with the much simpler integer addition. Tests of BitEnergy AI's L-Mul algorithm (L-Mul stands for linear-complexity multiplication) point to a potential energy saving of up to 95%. Importantly, precision is also increased, at least for some architectures.

Computing spoke to Luo and asked him to talk us through what his research indicates and what he hopes to achieve. The interview is edited for brevity.

Many groups are trying to reduce the energy consumption of generative AI. Can you tell us where your efforts fit?

Our technology is orthogonal to hardware efforts which improve data I/O, and also to software efforts like model pruning and quantisation which reduce the complexity of the parameters. We will improve the computation, so our algorithm can be adapted to most existing efforts in improving model efficiency.

What percentage of total energy used by an LLM in its lifecycle is down to the use of floating-point tensor multiplication?

The energy we reduce is for tensor computation. After the tensors are loaded on the tensor computation architecture, the compute architecture transforms the tensors from the input to the output. This part takes about 30% of the total energy.

In modern GPUs, the energy cost in running AI models can be divided into two parts. So one part is compute and the other is I/O, which is loading the tensors from the high bandwidth memory, HBM, to SRAM. This is the most significant part; it can take 70% of the total energy.

The major players in this area, like Nvidia, Google and Groq, are building new chips that optimise the data I/O. So they are solving this 70% problem, and we’re solving the 30% problem. However, it’s thought that those folks are close to solving the data I/O efficiency problem, and once they’ve done that the attention will shift from data I/O to tensor computation, and our solution will become more significant.

So, 70% of total energy consumption is down to the hardware. How much will those vendors be able to get this hardware I/O consumption down, in your opinion?

Well, there are lots of claims and promises. I've seen companies saying that their chips are a thousand times more efficient than GPUs. So it's hard to say, but 10 times is maybe a reasonable guess. The latest hardware, including TPUs [Tensor Processing Units], NPUs [Neural Processing Units] and ASICs [Application-Specific Integrated Circuits], is trying to minimise data movement. So I think that will be a significant energy saving.

Your L-Mul algorithm has been tested in simulations on 8-bit floating-point numbers, where you found that precision actually increases compared to the traditional methodology. Will it also work with 16-bit or 32-bit tensors?

This algorithm can work for any format. However, while for 8-bit we can improve both efficiency and precision, our experiments tell us that on 16-bit floating-point numbers, while we can definitely achieve higher precision than eight bits, the precision will be lower than 16 bits.

Why has no one thought to replace floating-point tensor multiplication by linear complexity integer addition, as you're doing, before?

Number one is that floating-point multiplication is the IEEE standard. Second, currently people are using the same chip to do both training and inference, and there is a desire to pre-train a model with full precision, 16 bits. So, companies are still building chips that support full precision training, which makes sense.

However, there are some new signals coming up, suggesting that we might want different chips for training and inference.

Also, if we consider the AI models used on edge computing devices, like phones, they are naturally preferring the lower-bit models, like 8-bit or even 4-bit. They also have a strong preference for lowering the energy cost and extending the battery life. As a result, new approximation algorithms and architecture will be helpful for inference devices.

How much hardware re-architecting will your algorithm require?

It will be very easy to implement that architecture because we are not bringing in new on-chip devices or other chip design technologies. Our assumption is simply removing those floating-points multipliers on the chips - that's getting rid of a very complicated part of the chip - and using other existing devices on the chip to handle the multiplications.

Is it possible just to bypass this floating-point multiplier on the chip right now?

Ideally, we envision having a chip without floating-point multipliers as an efficient solution for AI inference hardware. However, with existing Nvidia GPUs it should be possible to bypass the floating-point multipliers, but it will need something written into the CUDA kernel or the instruction set to support this at the software level.

What are the next steps?

We have filed a provisional patent, and we're hoping to file a full patent within a year. We have also submitted our research paper to ICR 25.