AI for Software Engineering
Article URL: https://mattamonroe.substack.com/p/ai-for-software-engineering Comments URL: https://news.ycombinator.com/item?id=47621513 Points: 1 # Comments: 0
AI is all the craze, especially in the programming scene. Marketers tell us that it will make us more productive, make better code, that we don’t even need to know how to code now because the AI will just do it for us. Better! Faster! Stronger!
But then you use it and the out of the box experience leaves you saying “Wow, that is some craptastic code, and what made this thing think that raw dogging unencrypted auth credentials in HTTP headers is a good idea?!?”
As someone who has spent their career searching for ways to make the software they write more reliable, more robust, and more scalable, I have found this experience very frustrating. I’ve spent much of my time experimenting with AI (and using it to ship production code), trying to find ways to make the AI plan the designs and write the code I wanted.
This post is about some of the strategies I found, and the lessons I learned, for how to make AIs write code that you can ship with confidence. It starts with a basic overview of how AIs work, before going in on how you should approach working with them and how to get the most out of your Agents.
You might have heard that AIs “think” by using a statistical model to generate text and this is true. But more fundamental to that is HOW they generate that text. Understanding how the next layer down from where you work operates is useful across software development. If you are writing a Graphics Shader, you are going to want to know how the Graphics Pipeline works, even if you never write any code to modify that Graphics Pipeline. So lets go over some of these fundamentals and build a mental model for what the AI is doing in that next layer down.
I’m sure you’ve heard that Nvidia is making bank selling graphics cards to datacenters. The reason behind this is that their cards are good at Matrix Multiplication, the same matrix multiplication operations you probably learned about in your high school math class. When I say these cards are good, I mean they are GOOD. Like really good. Like so good they can gold plate them and they still sell like hot cakes good.
Being good at MatMul matters for AI because the core of the LLM model is a giant matrix of weights. Each weight specifies the likelihood of the next token (i.e. word fragment) being the token represented by that weight. But here’s the cool thing, every time you generate a token, all of the weights in the model shift. This then sets you up to generate another token, and they shift, and another, and they shift. Until finally you have your entire generated output. The implication here is that every word included in the prompt and the response change the underlying model. This is what people refer to when they talk about Context.
This means that every time it generates a fragment of a word, it has to perform multiplication operations on every word fragment in not just the entire English language, but every language it has been trained on (that includes programming languages, not just natural languages). There are easily billions of operations as part of these matrix multiplication operations. This is why those Nvidia card are so valuable.
If you think about this in an abstract way, every token causes the matrix to transition to a “new” matrix (all the weights are different), with every token in this “new” model representing another weighted transition to another matrix.
It’s a giant weighted graph, where each matrix is a state, and each token is a weighted edge.
We can use this model of understanding to our advantage. Our goal in getting the model to produce good code, is to move it out of its default state (producing that craptastic code we talked about at the start), into one where it will generate the code WE want.
The best tool I found for effecting the models code quality, is the global AGENTS.md or CLAUDE.md (I’m going to just refer to both of these files as AGENTS going forward). By specifying in these files HOW you want the code to look or be designed, you control the starting point of every one of your agents, every time. I will warn you not to make these files too big. They are included in every LLM request you send as part of a limited context. How big is too big? That depends on the model you are using, they all have different maximums for context size. The thing to keep in mind is we want our AI Agents to have unused context available for reading our designs, our code, and our other instructions. If you start hitting compaction very frequently, it might be time to trim down your global instructions. Keep the global files high level and don’t make them project or language specific.
If you have more specific instructions, there are all sorts of places you can put them and they should all be serving different purposes. Each directory in your project can have an AGENTS, and it will only be included in the context when an AI Agent reads files from that directory. So put project structure information or how to build and run the project in the root directory AGENTS of your project. Package specific instructions should be in the packages themselves. You don’t want code quality instructions sneaking into your DAL AGENTS file because that won’t transfer between projects (or other areas of the code base).
Something to remember about context windows: there is a reason they exist. The more we run transitions away from the model’s starting point, the farther we get into uncharted territory. This is why compaction (having the model summarize its context, and replacing the context with that summary) exists, to keep us in that sweet spot of generating good code. A small context keeps the agent focused on the specific things you want it to do, because the state it is transitioned into is one specific to those instructions.
I know that according to Elon we are only 2 years away from Full Self Driving, so by the time you read this the analogy might be antiquated. Just try to remember the before times.
You need to be driving your agents in the same way you drive your car. You don’t get in your car, and just let it take you wherever it feels like. The car doesn’t make the driving decisions, you do, the car just gets you to your destination faster.
Sure, you can listen to podcasts, and maybe zone out a little on the high way, but when your exit is coming up, you need to be locked in and ready to go. YOU are the one that needs to be making the decisions, not the car.
When I hear people talk about “Just going with the Vibes” and “Forgetting the code even exists” I want to shake them. This is letting the car drive.
LLMs don’t actually know good code from bad code. All they “know” is statistically what the next token should be given the context and that is based on the patterns they have seen so far. You are the one that decides what good code is, so don’t take your eyes off the road.
LLM Agents are really good at replicating patterns. Like really freaking good. So good that they will find that shit code from the guy that got fired for being a dingus after only 3 months, and they will replicate it over your entire code base. They are basically these guys
Stargate SG-1 Replicators, IYKYK. It’s not lost on me they are bugs
You need to be constantly vigilant that the code and patterns making it into your project is the code and patterns you want to have. Because once it’s in one place, it will soon be everywhere; ripping it out is going to be a huge pain.
This is one reason I created template repositories for our team to use (we were making microservices, and needed to get them up and going fast). By creating a good starting point for the project, and establishing the patterns we wanted early, it became very easy to throw AI Agents at problems and they would quickly produce the code we wanted.
When making template repositories, don’t just have a build, a hello world, and some CI workflows. Build actual implementation patterns that the AI can read, and replicate. Your prompts are no longer “make an endpoint that does X” it’s “make an endpoint that does X, base it off /path/to/sample/endpoint”. That extra little bit hones the model in on producing the right patterns. As the project grows, you can remove these sample endpoints (and I think you should), and start using the actual code in the project. Eventually you will hit a critical mass where everywhere the agents look they see your patterns, and that’s what they replicate.
Is your agent writing unit tests? It should be! Do you have integration tests? It should be writing those, too. Is it running all these tests, every time, automatically in the background? Well why not? Have you instructed it that a unit test which doesn’t call production code or have at least 1 assert is useless and should be rejected? If you haven’t, you can be guaranteed it’s going to write tests just like that.
Do you have agents review your plans and code? Do they do it automatically after they create these artifacts? Do you have multiple models doing these reviews?
All of these steps are ways you can refine your code and keep it stable.
Having unit and integration tests for the AI Agents to run will allow agents to find regressions they are introducing. It’s a big red warning for the AI Agent that it did something wrong and it’s also a warning for you when you see the AI Agent making changes in unrelated parts of the code to make tests pass.
Having multiple agents reviewing designs and code before you do attacks the problem from different angles. This saves you time. The agent can easily find and fix simple issues with the implementation or tests without you needing to be involved.
I recently watched a Theo video where he started talking about gstack. I would say I 90% agree with the take he had. We agree that having CEO agents seems like it’s…if not bad, then not useful. But there is one thing he said that I think just isn’t correct.
“If the model is smart enough to be all these things, why are we still defining all these things?”
It’s a good question, and one that I think the Weighted Graph model of thinking about these agents answers. We want to have an AI Agent that is specifically instructed to focus on architecture, so that it is in a state to investigate the project architecture. We want to have an AI Agent that is in a state to inspect the security of our app, so that we can make sure security is considered. We want to have an agent that has specific rules to validate our unit tests ACTUALLY TEST THE CODE and have ACTUAL ASSERTIONS IN THEM! When you throw a generic agent at the code and tell it to “fix all the issues”, it’s going to do that in a very generic way. We want our models to be general so they can do lots of things, but we want the tasks we set these models to do to be very specific.
And the biggest reason we want our agents specialized is that it allows us to perform all of these checks in parallel using sub-agents. Running multiple agents in parallel with specific instructions allows us to have all our checks run without the AI Agent getting overloaded trying to do 15 things at the same time, or serially running through a single checklist for 2 hours. A single agent trying to do all of these things, can do them. It’s just going to be done poorly. It will miss issues that you don’t want it to miss. Maybe it finds the performance bug, but misses the security issue (Or vice-versa).
The model has the capability to do all of these things, but not all at the same time reliably, because by default it’s a generalist. These AI models were trained on human data. How many articles do you know of that cover software architecture, proper testing, common security vulnerabilities, performance bottlenecks and what your business logic should be doing? That’s multiple careers worth of information, it’s not all going to show up in one place for the LLMs to train off. So when you have a generalist model, it’s not going to do all of these things. You need to specifically tell it to activate each of these parts of its model because they are different states (For more information on this, look up how Mixture of Experts models work). Sub-Agents with specific instructions to perform these tasks will do all of them incredibly well.
Going back to our car analogy, the exit you need to be prepared for is 2 different times during development. When you switch from plan → implementation, and when you are all done and ready to commit. These are times you need to be the most vigilant for what the AI is doing. You should be going over your plans with a fine toothed comb, and making sure the AI isn’t planning on doing any funny business.
You should be making sure the plan covers not only the requested feature, but that it is designing the feature the way you want it designed. It should be as if you are going to write the code yourself and these are the instructions you would follow. All of your thoughts and taste for how the feature is implemented should be included. If you don’t specify it in the plan explicitly, the AI Agent is going to make a “decision” on its own. There is no guarantee it would be the decision you would have made. A great tactic here is to have the AI Agent present all of the assumptions in the plan for you to review and make a decision on.
And yes, you should also be reading the code. I don’t know why, but this is probably the take that is going to get me flamed the most. You have to read the code. If not for you, then for your co-workers. It is impolite to ask a co-worker to review code you haven’t reviewed yourself and I will die on that hill. I don’t care what the YC founders and all the cool kids are doing. They are building new prototypes every week and don’t have to live with the consequences of their (AI’s) poor decisions. You and your team, maintaining this app for the next X years, do have to live with those decisions.
Not reading the code is how you end up with a marketplace that doesn’t dedup or rate limit the download counter. It’s how you end up with hard coded or exposed secrets.
I don’t think it’s a coincidence that many of the companies pushing this way of thinking have some of the worst uptime you have seen since the 90s.
Source: https://mrshu.github.io/github-statuses/Source: https://status.claude.com (3/31/26)
I’m not perfect. I have caused my fair share of incidents and I’m not trying to dog on any of these teams. I just want to propose that maybe, just maybe, the actual code you run on your servers matters more for your uptime than the prompt or model you used to generate it. And maybe, just maybe, you should put some eyes on that code.
If you are making a one-off tool, or a prototype that 100%, definitely won’t be the cornerstone of your business tomorrow. Sure, go with the vibes. That can totally be useful, but it’s not engineering, And don’t let Charlatans try to tell you it is.
Obviously AI is here to stay in our industry, and over time it is going to keep getting better. The Tools we have now are the worst they will ever be; it only goes up from here.
We need to be thinking about how to make these tools work for us, how to mold and shape them into fine pointed chisels that we use to create beautiful works of engineering art. I hope you can apply some of these strategies to make your code better and improve your workflow. If you do, be sure to reach out and let me know. I would love to hear from you!
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.





Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!