How Arthur.ai Turned a Vibe-Coded Jira Bot Into a Reliable Agent in Two Weeks

Long Slack threads about bugs are a daily reality in most engineering teams. Someone describes a problem in detail, others add context, screenshots get shared, and then the conversation often fades without a clear next step. The ticket either never gets created or arrives missing important details. Everyone moves on, but the friction quietly builds up over time.

That was exactly the situation inside Arthur.ai until Product Manager Madeleine decided to do something about it. In just a couple of days, she built a simple Slack bot using Claude Code. The bot would listen to ongoing threads about bugs or issues, summarize the conversation, and automatically create a Jira ticket with a title and description. For the first two weeks, it delivered real value. Tickets started appearing without anyone needing to manually copy context. The team saved noticeable time, and people were genuinely excited about the improvement.

But that early success soon revealed deeper problems. What felt like a clever shortcut began showing its limitations in everyday use. This is the full story of how the Arthur team recognized those issues and systematically turned their quick prototype into a reliable, production-grade agent in just two weeks. The journey reveals how easy it is to mistake a working prototype for a dependable solution, and how much discipline it takes to close that gap.

The Original Problem

The pain was familiar and constant. Every time a bug or feature request came up in Slack, someone had to pause their work, gather the relevant messages, try to recall the important context, decide on the right priority, and then manually create a Jira ticket. Details often got lost along the way. Reproduction steps were forgotten. Someone would later ask what the original issue was, forcing the team to scroll back through the thread to reconstruct the conversation.

The Arthur team lived heavily in Slack for technical discussions. Bug reports, edge cases, and customer escalations all happened there in real time. The process of turning those conversations into actionable tickets felt inefficient and error-prone day after day. Engineers disliked the constant context-switching. Product managers hated chasing incomplete tickets. The whole workflow created unnecessary drag that everyone could feel but no one had fully solved yet.

The team knew there had to be a better way. The conversations were already happening in real time with rich context — why couldn’t the ticket creation happen automatically from the same thread? That simple question set the stage for Madeleine’s experiment. At the time, they believed that automating the most painful part of the process would remove the friction once and for all.

The Vibe-Coded Quick Win

The team believed the fastest solution would deliver the most immediate benefit. Instead of spending weeks on a formal build, they decided to move quickly and see what was possible with the tools already available.

Madeleine took the lead and used Claude Code to create a straightforward Slack bot. The bot monitored relevant channels, summarized ongoing threads about bugs or issues, and automatically generated Jira tickets with a title and description. The approach was simple — no complex architecture, no heavy frameworks, just a clean prompt and a few API calls.

In the first two weeks the results felt impressive. Tickets began appearing without manual copying and pasting. The team saved real hours every week. Internal reactions were positive, with comments highlighting how much smoother the process had become. At that stage, the dominant thinking was optimistic. The bot was producing tickets that looked reasonable, so the core problem seemed essentially solved. The team felt they had found a practical shortcut that delivered immediate value without over-engineering the solution.

Everyone involved was encouraged by the early wins. The bot was doing exactly what they hoped it would do: turning chaotic Slack discussions into structured Jira tickets with minimal human effort. For a brief period, it looked like the experiment had succeeded beyond expectations.

The Harsh Reality – Why the Vibe-Coded Bot Was Unreliable

The cracks started appearing soon after the initial excitement. Tickets frequently came through with formatting issues that made them harder to read in Jira. Headers and bullet points rendered incorrectly, and code blocks often turned into unreadable blocks of text. Engineers would open a ticket only to spend extra time cleaning it up before they could begin actual work.

Priority assignment proved especially problematic. Minor issues in development environments were regularly marked with high or blocker severity, the same level reserved for production outages. The bot lacked a clear sense of real impact or context, so nearly everything appeared urgent. Many tickets also arrived incomplete. Important reproduction steps, environment details, customer impact, and direct links to the original Slack discussion were often missing. The bot could summarize a thread, but it struggled to capture nuance or ask for clarification when information was unclear.

The deeper issue became obvious over time. The entire bot operated as one large, opaque LLM call. There was no visibility into what the model was doing step by step, and no systematic way to measure output quality. Problems only surfaced when a ticket looked wrong, which meant the team was always reacting after the fact rather than preventing issues upfront. The team began to see that a bot that worked sometimes was very different from one that worked reliably every single time.

The Turning Point – Deciding to Do It Right

Faced with these recurring problems, the team reached a clear decision point. They could continue tweaking the prompt and hoping the issues would gradually disappear, or they could treat this bot with the same seriousness they applied to any production agent.

They chose the second path.

The pivotal realization was that they could not keep iterating blindly. Without proper visibility into the bot’s internal process, they were essentially guessing at solutions. Without structured evaluations, they had no reliable way to know whether changes actually improved the output. They needed to instrument the agent properly from the start.

So they made a deliberate shift. They decided to use their own Arthur Engine platform — the same technology they provide to customers — to observe, evaluate, and improve the agent in a systematic way.

At first they had assumed that refining the prompt alone would be enough. However, they quickly saw that approach left them without clear feedback loops. They changed their thinking entirely: start with full instrumentation, define success through concrete evaluations, and then iterate with confidence and visibility.

That decision marked the real beginning of the two-week transformation. The team moved from treating the bot as a quick hack to treating it as a proper agent that deserved the same level of rigor as anything they shipped to customers. This mindset shift was what allowed them to make rapid, measurable progress instead of endless guesswork.

The Step-by-Step Transformation

With the new approach in place, the team followed a clear sequence of steps that turned the fragile prototype into a reliable agent.

First, they instrumented everything from day one using Arthur Engine. They added full OpenTelemetry tracing so they could see exactly what the model was doing at every step. This immediately revealed that the original bot was doing one giant LLM call with hardcoded logic and no tools or reasoning chain.

Next, they wrote evaluations before fixing anything. They identified the three main failure modes from the traces — incorrect ADF formatting, wrong priority assignment, and missing critical information — and created binary pass/fail evals for each. These evals became the single source of truth for whether the agent was improving.

Then they refactored the architecture. They moved away from the one-shot LLM call to a proper multi-step agent with tools. Prompts were migrated into Arthur Engine’s prompt management system so they could version them and update without redeploying the entire bot.

Finally, they iterated with continuous evaluations. Every new trace ran through the evals automatically. When a change caused a regression, they caught it immediately. They refined the priority logic to reserve high priority for genuine high-impact issues, added explicit instructions for ADF formatting, and improved how the agent extracted reproduction steps and context from threads.

This systematic process — instrument, evaluate, refactor, iterate — allowed them to make meaningful improvements every single day instead of guessing in the dark.

The Role of Arthur’s Own Platform

Arthur Engine played a central role throughout the entire transformation. The team didn’t just use it as a nice-to-have tool — it became the foundation that made rapid, confident iteration possible.

The tracing capabilities gave them complete visibility into every step the agent was taking. Instead of wondering why a ticket came out wrong, they could open a trace and see exactly where the model made a poor decision or missed important context. This visibility removed most of the guesswork that had plagued the original version.

The evaluation framework was equally important. By defining clear, automated evals early, they could measure progress objectively. Every change was tested against the same set of criteria, so they knew with confidence whether an update actually made the agent better or introduced new problems.

Prompt management inside Arthur Engine allowed them to version prompts and update them without redeploying the bot. This meant they could experiment safely and roll back quickly if needed. The combination of tracing, evals, and prompt management created a tight feedback loop that accelerated their progress dramatically.

In the end, their own platform wasn’t just helpful — it was what enabled them to move from a flaky prototype to a reliable agent in only two weeks. The experience reinforced a broader lesson: the same rigor they recommend to customers applies equally to their own internal tools.

Our Take

The Arthur.ai team’s two-week journey offers a practical blueprint for anyone building agents today. What began as an optimistic quick win revealed that speed without visibility and evaluation leads to fragile results. The real breakthrough came when they decided to treat their internal bot with the same discipline they apply to customer-facing work.

By instrumenting early, writing evals first, refactoring into a proper multi-step agent, and iterating with continuous feedback, they created something that now works reliably instead of only sometimes. The process showed that good prompting alone is rarely enough. Reliable agents need observability, structured evaluation, and version control for prompts.

For teams building agents — whether internal tools or customer products — this story highlights an important truth. The difference between a vibe-coded prototype and a production-grade agent is not luck or better prompts. It is systematic instrumentation and evaluation from the very beginning.

As agentic systems become more common, the organizations that succeed will be those that treat reliability as a measurable property rather than an assumption. Arthur’s experience shows that the tools and processes needed to achieve that reliability already exist — the challenge is choosing to use them consistently, even on internal projects that feel simple at first.

GetAIGovernance

Back to All Articles

AI Governance

AI Security

AI Monitoring

AI Compliance

Need help choosing?

AI Governance

AI Monitoring

AI Compliance

AI Security

Research Reports

Market Trend Analysis

Explore All Resources