Philosophy
Model-Checking & Agentic AI
Perhaps you read this far because you're trying to figure out if this is related to agentic AI? Actually py-mcmas has experimental features along these lines. But before diving into details or roadmap, let's look at the two elephants in the room:
What is an agent, anyway? - Agent has never been well-defined, and means different things to different people. There's the multi-agent system[^11] sense, and the agent based modeling[^12] sense, and these things are great for a variety of reasons, but that work also isn't necessarily related to what most people mean when they talk about agentic AI with LLMs.
Specification languages aren't programming languages. - Proving properties about systems is really nice, but it usually doesn't step you closer to actually building that system. And if you do have a running system already.. modeling it is basically a separate activity that starts from scratch.
In both cases, you could say that there's a gap between theory and practice. One might ask.. does it have to be that way? Formalisms could see more industry-adoption if they were easier to use. Meanwhile agentic-AI could benefit from using more rigorous formalisms in both design at the system level, and in implementations at the level of agent-code and and agent-frameworks.
So.. building bridges that connect MAS research with agentic-AI, or connect programming languages with specification languages, is a huge project. But isn't it about time to combine the strengths of these approaches? One goal for py-mcmas is to function as kind of lab (and lab notebook!) exploring related ideas.
If you'll tolerate even more background, read on in the next section for more motivation and use-cases. Otherwise, jump directly to a feature list or check out some demo code.
Road Map / Philosophy
So what's the appeal of connecting model-checking and agent-based-modeling with agentic-AI anyway? Let's start with a list of things that would be nice to have:
- Fewer hallucinations,
- Less magic, more determinism,
- Easy access to specialized solvers
- Grounded reasoning & useful constraints,
- Structured workflows & workflow composition,
- Agent coordination & intelligent routing between agents,
- Agents that can directly track, update, and inspect their own knowledge and beliefs
More controversially.. maybe a lot of this boils down to just a few questions: How to do more explicit domain modeling? How exactly should we hybridize LLMs / symbolics? How can we get a better layer of persistent metadata, preferably one that can represent things like confidence, provenance, topology, and traces.
There is also (gestures vaguely) all of the rest of the architectural stuff that we need to build around our amazing new stochastic oracles, so that we can stack them up into systems that are predictable and reliable. One recent write-up that's touching on lots of these ideas argues that AGI is now an Engineering Problem[^13], and that's well worth a read. But really, you could argue that chain-of-code[^14] as an alternative to chain-of-thought, emerging superstructure in agent-frameworks[^15], and even MCP in general[^16], are all evidence of movement in this direction.
The rest of this section is about how py-mcmas (or something like it) might fit in.
For lots of work, what we really want is user-prompted domain modeling that's flexible and frictionless, but is also constrained or grounded in useful ways. LLMs nail the flexible and frictionless part, but don't typically create or evolve any kind of explicit domain model. For many use-cases, "creativity" is an ambiguous virtue!
If you're not targeting "code" as an output, then there's a good chance you're looking for something like answers, or evidence, or arguments, and preferably something you can check. Even if you think you can tolerate some hallucinated output as a leaf, what if we're in the middle of a problem-solving process? Probably you would prefer an artifact like a witness, counter-example, or trace, and a problem-specification that you can actually evolve and iterate on.
An LLM that's structured so as to only output valid structures for things like constraint-problems, simulations, and games might seem more useful for planning or general problem solving than it is for code. But since model-checkers easily accomodate things like namespaces, basic data-types, states, and transitions, they also serve as a reasonable abstraction for code too. Vibe-coding, viewed as a guided search through implementation space, is missing something. Stepping through specification-space as an intermediate step basically partitions implementation-space by throwing away lots of useless candidates.
The last section hints that even if you're not trying wrangle simulations, you might still be working on something that benefits from creating and updating an explicit world-model "in the background".
Although chain-of-thought "reasoning" is surprisingly capable, it fundamentally burns up tokens only to pile baseless on baseless. Chain-of-code seems like a better approach in several ways, but if what we're really doing is reaching for an appropriate intermediate language that improves reasoning and reduces hallucinations, how about "chain of specifications"?
There are various tricks you can use to try and force a situation where LLMs are tracking an explicit domain model. For example, there are lots of agentic-workflow proposals for keeping a running plan in markdown, or if your LLM is good with mermaid, you could try to prompt it to make running changes to a diagram as it emits code changes. Good idea! But these things are just low-fidelity substitutes for specifications, and still aren't very amenable to analysis, validation, or execution.
Can't we do better? Specifications, even incomplete ones, are something that can typically be weighed and measured. And when you really come down to it, don't most problems reduce to something like planning, scheduling, protocols, games, systems, or sims?
The applications so far are pretty obvious, and sort of argue in favor of combining agentic-AI with formal methods in general. That's a big toolkit to choose from, and the goal of fmtk is to lay some of groundwork for multiple tools on the backend.
But the main focus of py-mcmas is ISPL because the model-checker itself also has a concept of "Agents". This kind of reflection seems useful for things like meta-analysis, especially if you can generate specifications automatically or semi-automatically. What's that good for? How about things like:
- Intelligent routing
- Design-layer validation of "societies" of agentic-tools
- Proving theorems about your system-level agent-interactions
Does this or that mixture of weighted skeptic/critic/creator roles for your agents ever converge to stable quorum? If you had this layer for meta-analysis.. could you query it from the agentic-ai implementation-layer, allowing your agents the chance to reason about how they interact with other agents?
Add to this the ISPL support for explicitly modeling epistemics, temporals, aggregates, and strategies. Is it useful to allow agents to disbelieve other agents, or their own perceptions of their environment? Have you ever wanted to actually rank agents in some kind of hierarchy, maybe even categorize them as stable, unstable, or peculiar[^25]? Are there other ways we should be thinking about organizing stuff, and can we find the scaling laws for agents?
WIP / Wish List
There's not a ton of ISPL to train on, so LLMs aren't great at generating it. There's a chicken and egg aspect, because if you could generate synthetic data easily, you'd have more to train on. Here's the idea:
- Build ISPL expressions pythonically, with mcmas.ispl (pydantic models) and mcmas.logic.symbols (sympy expressions).
- Use any method to generate, recombine, and permute expression-trees programmatically, outside of ISPL.
- Generate source from expression trees, then train on new source
Avoiding raw ISPL source as the preferred representation is another option. This could fare better with LLMs in general, without any need for fine-tuning. For this to work,
- Leverage pydantic-ai [^17] and / or JSON-schema to constrain LLM output.
- Enjoy "completion" of ISPL fragments or full specifications, converting to ISPL-source only to validate or simulate them
What's actually the input for LM's in this scenario? In the spirit of simulation wrangling, you can try your luck autocomplete system specifications from natural language.
On the other hand, if you're already deep into practical agentic-AI with a framework like PydanticAI, OpenAI, or CamelAI SDKs... this suggests something else you might want to "autocomplete". In other words: Can we work as usual in our agent-framework and get a concrete multi-agent-system specification for free? For each of these 3 frameworks, converting your existing agents to a system-level description in ISPL is at least partially automated.
Autocompleting back-and-forth to specifications from their implementations in programming-languages and agent-frameworks is experimental, and your milage will vary based on how smart your backend LLM is. But py-mcmas works locally, aims for deterministic results, and tries to be clever about context and prompting. More details in this demo.
The last section raises the possibility of completion, but in the spirit of less magic an approach involving annotations, inspection, and reflection is another possibility. Think of it like gradual-typing: as your practical efforts towards building multi-agent systems evolve, you tag, categorize, and annotate the tools themselves as the expectations around them become more clear. More details in this demo
References
-
MCMAS: an open-source model checker for the verification of multi-agent systems, Lomuscio et al 2015 (pdf) ↩