AI & Strategy #2: Autonomous strategy design and the limits of LLMs

In this second post in the series on AI and strategy I want to explore how state-of-the-art AI systems, namely Large Language Models (LLMs), can be helpful in strategy design, and in so doing create some awareness of their limitations and pitfalls.

As a tool to have this discussion I’ll propose a framework to talk about capability levels of autonomous strategy design, much like how we talk about self-driving cars. We’ll formulate levels of self-designing strategies, whereby each level is a requirement for the next. We’ll also lean on the definition of strategy and tactics from the first post.

Ultimately I hope to leave you with two insights. The first is that there are real ways in which AI can enhance human strategy design. The second is a deeper understanding of why autonomy in strategy design (and really you may substitute for AGI here) is not just a matter of combining reinforcement learning with LLMs.

Recap: Situatedness in Strategy vs Tactics

In the previous post I redefined the words tactic to mean a response to an abstract situation, and strategy to mean a response to a specific situation. In other words, we use the meaning of tactic as in the phrase ‘Pincer Movement’: a type of move that brings a certain result in a certain type of situation. A strategy on the other hand is a specific response to a specific situation where it has been judged, by a specific agent, to yield a specific effect. You can learn tactics, but you cannot learn strategies, as tactics are patterns abstracted from the specific situations. Conversely you can design strategies, but you cannot design tactics, as strategies are responses to yield effects in a specific situation.

Not only is this ‘situatedness’ the defining difference between a strategy and a tactic, it is also, as I hope to convince you of, the key characteristic that limits the utility of LLMs.

Levels of autonomous strategy design.

Diagram showing a summary of levels 0 through 5 of autonomous strategy design.

Level 0 is your plain old powerpoint presentation or word document with your strategy as bullet points. It does nothing by itself and it isn’t really a system that knows or does anything strategy related.

Level 1 systems participate in strategy design in some useful way. They might suggest ideas or related resources, such as published articles or people to chat to. Given the right input, they could generate plausible explanations for, or responses to, problems.

Evaluation of a level 1 is in terms of relevance and the breadth of output. Using standard psychometric creativity tests would not be out of order either (e.g. TTCT).

A Level 2 system wouldn’t just be able to generate ideas, it would also have the ability to represent hypotheses, such as cause and effect, as well as confidence in those hypotheses. For example, a level 2 system could represent “we will win the contract if we extend the duration of support by two years“ and a strength of belief in it. When a level 2 system generates strategic content it doesn’t just produce the words, it can represent and communicate about it. That probably means it should know abstract concepts such as hypotheses, evidence, framings, alternatives, (in)compatibilities, trade-offs, and weighted scoring matrices. But note, a level 2 system doesn’t yet have to judge its own confidence; this can be outsourced to others. It just needs to be able to represent it and reason with it.

Evaluation of a level 2 system would happen by means of assessing its ability to answer questions about the strategy. Those questions would include matters of overall confidence (summing it all up), counterfactuals (e.g. what if an assumption doesn’t hold), and whether there is evidence for something or not.

A Level 3 system would provide its own judgements. It would represent the strategy and then score all its aspects in terms of confidence as well as relevance. That is, it needs to judge that the ideas it dreams up aren’t just compatible with the desired state of the world, but also move closer to it. It also needs to judge how strongly it believes these ideas will achieve the desired state. That comes down to asking itself hard questions. Is this evidence trustworthy? Does it apply to this situation? Will position X yield effect Y? What’s the expected percentage of customers we could upsell this new feature to? Can we get all stakeholders on board? Is this in line with our core values? Have I searched exhaustively enough to judge this line of exploration to be a dead end?

Although simple to state, it’s hard to evaluate objectively. The best way is perhaps to take an intersubjective approach and compare to human experts. However, in my personal opinion humans are pretty bad at systematic judgement also.

At Level 4 we really get into the domain of design. A level 4 system generates plans for activities, driven by lack of confidence (i.e. uncertainty), that explore the strategy space and reduce uncertainty. This amounts to generating ideas for how to build confidence in the judgement of strategic options, be they negative or positive. As an idea it is related to concepts like failing fast and testing your assumptions.

Exploring the space does not necessarily mean executing part of the strategy, although it could. However, the actions are necessarily intended to be informative, and possibly executive. The key to the task at hand at level 4 is trading off the cost of the informative actions against the expected reduction in costs of execution, with total failure in execution representing the upper bound of cost. That is, you might as well start executing if you cannot find a way to reduce uncertainty, or if informative action is at least as expensive as executive action.

The search for a plan that yields information is itself a strategy design process, so this is a clear point of recursion. The spawned process would need to be bounded by the risk budget available in the top-level strategy design process.

Evaluation of a level 4 system would need to address the breadth of predictions and outcomes considered. Another criterion would be optimality of the trade-off between the costs of executive action and informative action, as well as the cost of the search itself.

At Level 5 the system is fully autonomous. It can decide to commit resources, change its position in the world, and thereby execute on the level 4 plans and move towards a strategy it is confident in. It can act either through other embodied agents or it is itself embodied. In any case it needs to be able to adopt information from the environment into its model. In simple terms, you’d be able to hand it control of an organisation, let go of the wheel, and trust your self-designing strategy to not steer straight into oncoming traffic. How close do you think we are to this level today?

Evaluation of a level 5 system is firstly in the quality of the strategy that it designs and commits to, and secondly in the cost of the design process.

Great, how would these systems help me?

A level 1 system you can currently build with LLMs and some simple prompts. It can help you consider aspects and directions you otherwise wouldn’t have. For example, it could tell you to consider "optimising low-touch onboarding because inbound prospects cost less" in response to "I’m trying to get your customer acquisition costs down". These systems could also help you find relevant information and thereby evidence. Their value is that they are a proactive partner that can reduce blind spots. Our prototypes show this level is feasible today, although costly to evaluate and iterate on. It’s however clear that it often increases the cognitive load instead of reducing it. Ignorance is bliss, and users need to be skilled in strategy design already.

A level 2 system can help you stress test a strategy. It can show gaps, calculate overall confidence from the sum of its parts, or show the impact of some dial moving one way or another. Building on the previous example, it could add "optimising low-touch onboarding" as an approach X and "inbound prospects cost less" as evidence Y. It could answer the question "what evidence do you have for approach {X}?", or know approach X was to be discounted if the confidence in the evidence Y were marked down.

The level 2 system could also proactively, in the background, do periodic searches for new information that’s relevant and alert you when your strategy perhaps doesn’t hold anymore (e.g. your competitor just launched a product at a price you cannot meet). This certainly reduces cognitive load, but only if you were to do these tasks manually otherwise. Of course you’d have to already have a pretty thorough strategy design practice anyway.

We have partial prototypes at level 2. Calculation with confidence scores is manageable either through Bayesian or Fuzzy Logic semantics. The tricky bit is finding a level of strategy description at which it is natural for humans to communicate. Very few people are willing or capable to work at the level of specificity that seems to be required, and so therefore it seems real utility comes at level 3.

To further illustrate what is meant by representing strategy it’s worth mentioning DoubleLoop, which has a basic strategy model and can do simple calculation, but cannot be said to reason about the strategical structure nor does it represent uncertainty.

At level 3 we reach a magical threshold, and here we return to strategy versus tactics. Level 1 and 2 systems can get by if all they do is respond in accordance with previously seen patterns (tactics). However, at level 3 the system is to be explicit about its confidence in the applicability of its suggested tactics to the specific situation (strategy). It would add "consider optimising low-touch onboarding, because I’m 60% sure that would work in our situation".

And herein lies the rub: any problem that requires a somewhat sophisticated strategy is always a novel problem. It always concerns a combination of factors never seen before, to the degree extrapolation is likely to fail. It’s not that extrapolation as a mechanism is flawed, but that the information needed to extrapolate isn’t yet encoded. The information that is to be extrapolated from often also doesn’t exist: the really sticky problems your organisation is facing are probably in some way experienced for the first time ever.

To illustrate, imagine an LLM to accurately predict whether your coworker Toby will get the crucial Snugflap working by Friday, or whether the UK can get a hard Brexit deal without a border on the isle of Ireland. The level of situation specific knowledge needed is not available in a web-scale scrape of information, nor perhaps does it exist at all.

However, a level 3 system could outsource its confidence scores to humans, and if it could just know about the things it doesn’t know, it could perhaps graduate to level 4 with a bit of cheating.

The big win of a level 3 system is that it would unlock cheap and automatic search for novel solutions to novel problems. It would obviate the huge cognitive load on scarce human experts to assess a great many individual hypothetical situations and all their facets.

At level 4 we leverage the ability to act in the world and explore. It would relieve the cognitive load of generating, planning, and coordinating action. For example, it could reason that at "60%" the risk is too high to invest fully in low-touch onboarding, and would generate the plan to "run a one week ad campaign to test whether click rates support this approach".

A level 4 system can plan an iteration, and at level 5 it could commit resources and execute the iteration.

That’s just a planning system!

Yes, on the one hand this is a planning system, and it in that sense it falls in the category of A* or reinforcement learning systems. However, it has specific qualities that set it apart.

A base planning system picks actions given a situation. It is typically defined in terms of a desirable future state and possible actions. A policy is then the algorithm that picks actions given a current state and a desirable state. What I’m describing is a system that generates to a large degree its own view of what a desirable state is, its way of framing those states, and picking between states based on criteria it discovers. That is, it builds its own ontology of the state space rather than an a priori one.

Or put differently: it’s a planning system that doesn’t just learn its environment and act in it, but also learns, in situ, the ontology, language, concepts, and affordances that are useful to describe the desirable state, formulates the desirable state, is able evaluate different competing or non-integrable paradigms, and builds an approach to get to the desirable state and has a view of how confident it is in that approach, and how it gets confident.

What does this mean for the state of the art AI?

Can a combination of reinforcement learning and large language models ultimately reach this kind of performance? Who knows. However, one thing is clear to me: it would have to learn its concepts at least partially in situ, and not just from a large dataset.

Human creativity, seen as the ability to create new solutions and new abstract concepts, is our species’ unique ability to deal with vexing problems. A defining characteristic of these problems is that there isn’t a canned response; there is no simple tactic you get from a book nor some paint-by-colour solution. The situation is novel, and a design process is required to solve for it. That means not just learning the situation, but also learning the right paradigm for the situation, creating new concepts and language, and learning by intervening to get some predictive accuracy.

Perhaps this is fundamentally a class of problems that AIs spawned by the Transformer architecture, like LLMs, cannot hope to address. Their reliance on large textual datasets, pre-training, and fixed conceptual spaces preclude them from dealing with the one-shot nature of unique novel situations. Without situatedness, no AI can ever be an AGI.

Phrased differently it is about this question: judgement (level 3) is a necessary ability to be able to make progress in exploring the problem space (level 4 and 5), yet how can we hope to achieve level 3 if not for embodiment and the ability to explore through which the skill to judge is surely acquired?

This isn’t in its essence a new idea, as it is synthesising from work in Business Strategy, Creativity, Design, and AI. To illustrate, the topic of how humans make sense of complex situations is studied and used in various ways (to name a couple of entry points: 1, 2, 3). However, it is good to keep in mind the expansive nature of the tricky stuff humans deal with when we dream about AI, and in that way to be aware of the limitations AI systems will have for the foreseeable future.

You cannot outsource your judgement and exploration to AI just yet.