Microsoft unveils prototype LLM designed to analyse spreadsheets

SpreadsheetLLM excels at tabulated number crunching, researchers claim

Image:

Microsoft unveils prototype LLM designed to analyse spreadsheets

Microsoft has unveiled a new experimental large language model (LLM), dubbed SpreadsheetLLM, specifically designed to tackle the challenge of spreadsheets.

Described in a recent research pre-print paper, SpreadsheetLLM, which is still a research project, aims to bridge the gap between powerful AI models and the complexities of spreadsheets, widely used in business but a stumbling block for current AI.

Spreadsheets hold a wealth of data, from simple calculations to intricate financial models. However, their structured format, formulas and references pose difficulties for existing LLMs to process and analyse the information effectively.

Microsoft says SpreadsheetLLM is equipped with a unique encoding method to optimise LLM capabilities for spreadsheets.

One hurdle is the sheer number of tokens (data units) they need to process. To tackle this issue, Microsoft has developed a framework called SheetCompressor, which it claims can condense data by up to 96%, allowing LLMs to handle even large datasets within their processing limits, while still preserving the data's structure and relationships.

SheetCompressor improves performance in detecting spreadsheet tables by over 25% compared to vanilla approach, the research paper states.

SpreadsheetLLM comprises three modules:

Structural-anchor-based compression: This method strategically places "anchors" within the spreadsheet to enhance the LLM's understanding of the data layout. It then condenses the table by removing unnecessary rows and columns, creating a simplified "skeleton."
Inverse index translation: This module tackles the issue of empty cells and repetitive values, which consume excessive tokens. It employs a unique JSON-based method to create a dictionary that identifies non-empty cells and merges identical text, optimising token usage without compromising data integrity.
Data-format-aware aggregation: This module addresses the challenge of adjacent numerical cells with similar formats. It recognises that understanding the exact numerical values is less crucial than the overall data structure. Therefore, it extracts data types and formats, grouping adjacent cells with similar properties. This streamlines the process without wasting tokens.

The SpreadsheetLLM model uses the "Chain of Thought" prompting methodology to introduce a "Chain of Spreadsheet" (CoS) framework. This framework decomposes spreadsheet reasoning into a series of steps: table detection, matching and reasoning. This broad applicability, says Microsoft, has the potential to significantly transform spreadsheet data management and analysis, paving the way for more efficient user interactions.

In tests SpreadsheetLLM surpassed existing methods for spreadsheet table detection, "the foundational task of spreadsheet understanding," by 12.3%, and performed reasonably well on tasks involving answering questions based on spreadsheet data, although high rates of compression and long contexts decreased the accuracy.

The model was also able to significantly enhance the capabilities of established LLMs like GPT-3.5 and GPT-4 in understanding spreadsheets, with GPT-4 achieving a table detection score of nearly 79% when aided by SpreadsheetLLM.

While promising, SpreadsheetLLM isn't without limitations. Spreadsheets with fancy formatting like background colours and borders can still confuse the model due to increased token usage.

Additionally, SheetCompressor currently struggles with cells containing natural language.

SpreadsheetLLM is a research project, and Microsoft hasn't announced plans to make it public just yet. The research paves the way for some interesting possibilities, although the probabilistic nature of GenAI will not always be a good match for the precision of data held in spreadsheets.

There has bean a steady stream of announcements of LLMs claiming enhanced capabilities over their predecessors.

In April, Meta released Llama 3, the latest iteration of its LLM series, claiming significant performance improvements over its predecessors. Llama 3 arrived in two variants: Llama 3 8B with 8 billion parameters and Llama 3 70B with 70 billion parameters.

Meta is training even larger models exceeding 400 billion parameters. These future iterations aim to be multilingual, handle various data formats beyond text, and offer improved reasoning and coding capabilities.