
Occasionally I love ChatGPT. Like when I gave it a research paper I’d written and the itinerary for my planned trip to Australia this November and asked it to look for related art exhibitions I should visit or people to meet and it came up with some great suggestions with actual links to actual real things. But every time I get excited at a flash of apparent brillance, it fails utterly at another task of exactly the type the hype says it can do. Take this excel sheet with events and make me a nicely formatted paragraph about next week’s events. Nope. Or when I asked it to translate a PDF from German and it just made up some nonsense that even my German was good enough to see was related to a previous conversation, not the document.
I’m not even going to list more examples, because there’s actually a benchmark now that tests whether LLMs can do basic office tasks.
Turns out they can’t. The best language models can only complete around 30% of the basic office tasks they were tested on.
That means they fail 70% of the time. Wow. Or actually, I’m not really surprised, because it matches my experiences.
Researchers at Carnegie Mellon University set up The Agent Company, which is a fake software company built to test AI agents:

Xu, Frank F., et al. TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks. arXiv:2412.14161, arXiv, 19 May 2025. arXiv.org, https://doi.org/10.48550/arXiv.2412.14161.
The AI agent is given tasks like this one:
- Collect equipment requests from Huang Jie, Sophia Rodriguez, Mike Chen, Sarah Johnson by asking them about equipment name and quantity on Rocketchat at http://the-agent-company.com:3000/home
- Retrieve prices from ‘Documents/Administrative Specialist/products.pdf’ on owncloud(http://the-agent-company.com:8092)
- Calculate total costs and write down /workspace/result.txt
Or this one:
Navigate to owncloud at http://the-agent-company.com:8092. Use the content in “TAC_overview.md” in the “Documents” folder on owncloud to create a comprehensive employee manual, ensuring that it includes all relevant sections such as company values, code of conduct, operational procedures, employee benefits, and company policies. The final document should be saved as “Employee_Manual.odt” in “Documents/Human Resources Team” folder. This manual is a draft with room for creativity. If you cannot find the necessary content for some sections in the TAC_overview.md document, please feel free to suggest them in the manual.
Or this:
Please check the “Documents/Data Analysis/Customer.xlsx” spreadsheet available at http://the-agent-company.com:8092. The data sheet contains a list of our customers, and we need to classify them as either domestic or international. For entries with available locations, please enter “Domestic” in the next cell if the location is in the US; otherwise, enter “International”. Next, calculate the total number of domestic and international orders, and send the results to Sarah Johnson at http://the-agent-company.com:3000/ in the following format: “Domestic: {domestic_count}” and “International: {international_count}”.
The code is on Github, and you can see all the tasks here, or look at the descriptions of the fake employees (described as NPCs) that the AI agent is supposed to be messaging. Though they very often fail to actually message the NPCs.
Some of the ways the LLMs fail are quite interesting, as described in the paper “TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks” in section 7.3. For instance, an agent that couldn’t find the right person to message cheated by renaming another person to have the right name and messaged them instead! Or when one person says to contact another person about the issue, the LLM gives up.
However, most LLMs achieve a much higher score on the SDE [Software Development Engineering] tasks. LLMs fail these seemingly easier tasks due to lack of ability to understand documents, communicate with other people, navigate complex software and tedious processes, and autonomously automate repetitive tasks. We hypothesize that part of the reason lies in the fact that current LLM development is heavily based on software engineering abilities, such as coding, due to several high profile benchmarks that measure this capability (e.g. HumanEval, SWE-Bench) as well as the abundance of publicly available training data related to software. On the other hand, administrative and financial tasks, are usually private data within companies, not readily available for training LLMs.
It’s interesting that LLMs lack the ability to “navigate complex software and TEDIOUS PROCESSES”, the latter being exactly what they’re often hyped as able to do.
Here is an example of a task in the Admin category, which was one of the areas the AI agents scored worst in, with the best model only completing 13% of the tasks (see Xu et al. 2025, Table 5 for completion rates):
We are collecting employees’ preferences on drinks to help with our purchasing plan. Please navigate to http://the-agent-company.com:8092/ and find drinks_survey.pdf, which contains a questionaire that we have placed in the office. Please organize the employees’ responses into a CSV spreadsheet, clearly indicating the number of people who like each type of beverage.
The spreadsheet is prepared for you at /workspace/drinks_survey.csv and please complete it. (TheAgentCompany GitHub)
Yes, that is rather a tedious task, and the sort of thing we end up spending a fair bit of time doing. It’s also not the sort of task that a language model is obviously going to be good at. Language models model language, so they have a model of what words or concepts or images tend to go together. That allows them to do pretty awesome things – but something like this would be much better done with a classical computer program of the kind that says “if this then do that” rather than a language model which works more along the lines of “if this, then generate a string of things that are similar to what comes after this in the training data.” So you’re more likely to get a list of drinks most people in the training data like than a list of what your employees actually asked for.
This is a perfect example of when using AI is silly. It would have been much better to just have people fill in a structured form connected to a spreadsheet with a few simple formulas automatically adding up the numbers.
Sure, LLMs can often also write a simple python script to automate some of the process. But these tend to wildly oversimplify the thing you’re trying to do. For instance, when I asked one to write a tweet for each chapter of my book so I could promote it online, it wrote a Python script that took the first 1000 characters of each chapter and selected a random sentence. That’s neither a great use of Python or of a language model.
The times a language model would be useful are when you have unstructured data, like when I had a few sentences of feedback from each of 40 students in a class and wanted to summarise that for the evaluation report I had to write up. The language model produced something that looked OK if you didn’t really read it, but all it said was “some students thought X, while other students thought Y” without saying whether 90% of students thought X or whether Y might have been fixed if Z had been done.
I wonder whether that subpar summary would have been graded as “completed” by a benchmark like the one used by The Agent Company? A text existed, no worse than one I could have written if I really didn’t care about the job.
Of course, people often wonder whether course evaluation reports are ever read. They are required for compliance and dutifully collected and summarised by the department and sent to the faculty, which summarises the summaries for the university board, and they are logged and archived. If you view them simply as texts, as commodities that must exist, then using an LLM is brilliant. It can summarise all the summaries at every level and nobody has to read them.
But the hour or two I spend reading and analysing the student feedback in order to write a text is how I as a teacher actually evaluate the course and think about how it could be improved. Skipping the writing too often means skipping the thinking as well.
Related
Discover more from Jill Walker Rettberg
Subscribe to get the latest posts sent to your email.