Multi-purpose AI models like GPT-4 use 'orders of magnitude' more energy than task-specific ones, research
A steam hammer to crack a nut
There has been a surge in the popularity of products based on multi-purpose GenAI systems, which can handle images, video and text and carry out multiple different tasks. These multimodal or "zero-shot" models, including GPT-4 and PaLM, are finding their way into search engines, email and navigation.
However, the ambition of "generality" pursued by AI companies comes at a high environmental cost. For many use cases it's a high-powered steam hammer to crack a nut.
Researchers Alexandra Sasha Luccioni and Yacine Jernite from model-hosting platform Hugging Face and Emma Strubell of Carnegie Mellon University found that multi-purpose, generative architectures use "several orders of magnitude" more energy than task-specific systems when performing a variety of inference tasks, including sentiment analysis and question answering, even when controlling for the number of model parameters.
Generally speaking, the larger the model and the greater the modality the more energy is used. Image generation uses, on average, 60 times more power than text generation, and large image generation models use far more energy than small ones, their pre-print paper Power Hungry Processing reveals.
While the majority of total AI energy use comes from training the models, running and inferencing (questioning models and obtaining answers) is still more carbon- and energy-intensive than what went before, for example, queries in traditional search engines.
As more people and organisations use GenAI, it is therefore vital to understand choices and trade-offs, and to be able to pick the right model for the right task, rather than defaulting to an unnecessarily wasteful, multi-purpose solution.
The researchers attribute the higher energy consumption by multi-purpose models to the fact that they have a large output vocabulary to choose from, and are more dependent on the prompting strategy used than more task-specific models. Most multi-purpose models like GPT-4 interrogate all layers of the input to generate answers; task-specific models tend to only consider the last layer, which is more energy efficient.
The authors claim this comparison of training, finetuning and inference energy requirements is the first of its kind.
Unfortunately, there is a lack of standardised methodology and data for quantifying and comparing the energy consumption and carbon emissions of ML models, or for evaluating emissions embedded in hardware such as GPUs, which may make up as much as half of their deployment (running plus inference) carbon footprint.
The researchers suggest that more openness is needed from model creators regarding the upfront (training) and downstream (inference) costs of ML models, but note the "growing lack of transparency" in these areas.
They also call for more transparency around model architecture and training details, which would enable more research into the environmental impacts of ML.
"Given our findings and the increased deployment of generative, multi-purpose AI models, we hope that both ML researchers and practitioners will practice transparency regarding the nature and impacts of their models, to enable better understanding of their environmental impacts," they conclude.