Researchers at Microsoft have developed a framework designed to make it easier for large language models (LLMs) to analyze the content of spreadsheets and perform data management and analysis tasks, because why not?
In a research paper published on open-access repository arXiv, the Redmond team explains that the format of spreadsheets creates some significant challenge for LLMs.
According to the researchers, large spreadsheets often contain “numerous homogeneous rows or columns,” which “contribute minimally to understanding the layout and structure,” and, for that matter, make analysis difficult for humans as well.
To address this, their LLM-driven analysis tool, called SpreadsheetLLM, serializes the data, incorporating cell addresses, values, and formats into a data stream. But this approach runs slap into another problem, which is the token constraints of many LLMs, where tokens are strings of characters or symbols.
In order to solve this conundrum, the research team had to develop yet another framework called SheetCompressor to compress the data. This is comprised of a few separate modules: one to analyze the spreadsheet structure and discard anything outside of a table; another to translate to a more efficient data representation; and a third to aggregate data.
The first does its job by identifying “structural anchors such as table boundaries and removing other rows and columns to produce a “skeleton” version of the spreadsheet. With the second, the row and column format is discarded by converting to an inverted index in JSON format. Finally, adjacent cells with the same cell formats are clustered together.
The result is that a spreadsheet comprising, for example, 576 rows and 23 columns that would otherwise produce 61,240 tokens can be reduced to a more compact representation of the spreadsheet containing only 708 tokens, according to the example in the paper. In fact, the team claims that its experiments found that SheetCompressor typically reduces token usage for spreadsheet encoding by 96 percent.
The upshot is that SpreadsheetLLM appears to greatly reduce the token usage and computational costs for processing spreadsheet data, the Microsoft team claims, which could potentially enable practical applications even for large datasets.
Fine-tuning of various LLMs to enhance spreadsheet understanding could also potentially transform spreadsheet data management and analysis tasks, paving the way for more intelligent and efficient user interactions, the paper says.
Considering the widespread use of spreadsheets in business, this move by a Microsoft research team may have considerable impact, if it lives up to its promise. There is no word on whether this will ever be released as a product or developer resource – such as something baked in Microsoft Excel or Coppilot – and for now it seems to be a lab-level effort.
As VentureBeat points out, SpreadsheetLLM could, if it ever emerges in public, allow non-technical users to query and manipulate spreadsheet data using natural language prompts.
An adjunct professor at UCLA reacting on Twitter/X suggested there are “billions of dollars of value here as much of the financial and accounting worlds still run on spreadsheet and manual efforts.”
We’re not sold on mixing unreliable, guesstimating and hallucinating LLMs with grids of numbers, we have to admit. Neural networks like predicting outputs and loosely interpreting inputs, which isn’t quite what you want from a spreadsheet, in our humble view, unless you’re hoping for some kind of computer-aided creative accounting.
And whether this technology can prevent the gaffes that have been seen in corporate use (or misuse) of spreadsheets is perhaps doubtful. Last year, it emerged that trainee doctors had been rendered “unappointable” due to errors in transferring data from one spreadsheet to another. Then there is the infamous Excel blunder that led to the under-reporting of thousands of coronavirus cases during the pandemic in England.
The Microsoft research team also points out that SpreadsheetLLM right now has some limitations that have still to be addressed. Format details such as the background color of cells is ignored, because this would require too many tokens, although this is sometime used to encode information.
SheetCompressor also does not currently support a semantic-based compression method for cells containing natural language, so cannot categorize terms like “China” and “America” under a unified label, such as “Country”. So perhaps all those data analysts and other Excel experts can breathe easy for a while. ®