Building an AI Engineering Manager with GitHub and Middleware HQ
17 min read
"Wouldn't it be great if at the start of each week, you could have a tiny little assistant tell you how each team of yours is doing, where they're struggling, and what needs to be addressed, saving you hours of meetings and follow-ups?"
Today's fast-paced development environment leaves little room for friction within organizations, making data-driven decisions is crucial at all crossroads for leaders who want to free the trapped potential from their teams. For years, trailblazing companies like Netflix, Spotify and Airbnb have relied on studying their engineering performance via the DORA metrics to understand how their teams are performing and filling in the gaps proactively to boost productivity and prevent burnout. While it may seem taxing to get a system like this set up, luckily, engineering leaders of today have all the data they would need stored benignly in the most trusted collaboration platform in the world -- GitHub.
GitHub provides a rich tapestry of data on your development processes. With detailed metrics such as lead time for changes, deployment frequency, mean time to recovery, and change failure rate, you can calculate key performance indicators like the aforementioned DORA metrics and other trends that hold a mirror to your team's performance. These reports are essential for understanding the efficiency and reliability of your software delivery processes.
But how can we transform this raw data into something more meaningful, and hopefully get actionable insights that could improve our processes by even a single percent?
Let's talk AI, and enter into the world of Retrieval-Augmented Generation (RAG) systems. In this blog, we'll walk you through building a RAG system using your GitHub data from Middleware to dive deeper into what your DORA metrics, change rates, and other trends mean for your team's velocity.
How Can We Get These Engineering Metrics?
While it's fully feasible for you to get all this data straight from GitHub's API, it's too much work to collect, sanitize and pre-process this data for folks who want to get insights quickly. We're here to free up more time, not lead you down another rabbit-hole!
Free and open-source projects like MiddlewareHQ provide the tools needed to tap into GitHub's potential, delivering detailed analytics on key metrics such as:
DORA Metrics: Deployment frequency, lead time for changes, mean time to recovery, and change failure rate are crucial indicators of your team's efficiency
Workflow Metrics: These metrics help identify bottlenecks and streamline processes.
Pull Request Trends: Analyzing pull request data uncovers areas for improvement in your review process).
By querying a RAG system with the data from Middleware, you can peek into your team's development processes to a deeper level and get tangible insights. In simple words, you can:
Identify Bottlenecks: Use lead time and first response time metrics to identify bottlenecks in the review process and allocate resources more effectively.
Improve Deployment Processes: Analyze deployment frequency trends to streamline and automate deployment pipelines for more consistent releases.
Reduce Failures: Investigate high change failure rates and mean time to recovery to enhance test coverage and improve incident response strategies.
Optimize Resources: Correlate rework time with PR counts to better allocate resources and prioritize critical tasks.
Building a RAG System with MiddlewareHQ
"Let's make sure our Artificial Intelligence is actually Intelligent." Retrieval-Augmented Generation (RAG) is a cutting-edge technique that integrates retrieval-based models with generative models to create a system capable of providing detailed, context-aware answers. Unlike traditional generative models that rely solely on their training data, RAG systems leverage external data sources to enhance their responses with real-time, relevant information.
This makes RAG particularly effective in scenarios where accessing up-to-date or specific data is critical. Like in the context of our engineering activity and DORA metrics, RAG allows us to query large datasets from GitHub and MiddlewareHQ, enabling us to extract relevant insights about our development processes.
Let's get started!
Step 1: Setting Up Your Environment Ensure you have the necessary packages installed:
pip install langchain-chroma langchain-community pandas ollama
pip install -U sentence-transformers
Step 2: Getting the data! You can get all the data you'd need straight from GitHub, but you'd need to perform a fair bit of calculations and sanitization on that data to make it readable for our LLM's context window.
You can choose to do it all manually, or get MiddlewareHQ's Open Source DORA metrics tool to do the dirty work for you.
MiddlewareHQ integrates easily with GitHub, allowing you to access your data quickly. Once connected, you can start analyzing metrics such as DORA and pull request data.
We're going to use this project and its API calls to get the data we need for our RAG. To get started, you can follow the steps listed here for the Developer Setup of MiddlewareHQ.
Once you have the application running, you just need to add an integration with GitHub and link the repositories you want to assess. To keep things interesting for this guide, we're going to get the DORA Metrics and PR data from 3 of the most active and watched repositories in the world of Engineering and DevOps today:
It takes Middleware a few minutes to crunch the numbers and spin up a pretty dashboard to look at. While there's plenty to take away from this page itself, we want to go a few layers deeper. So let's use this data from Middleware to build our RAG.
The Middleware team maintains an extensive API to make it easy to interface with your DORA metrics installation. You'll need to access this Postman Collection and use all the APIs listed under the DORA metrics subfolder to get the data we need. These responses will contain the four keys used to calculate the DORA metrics we're looking for and their historic trends -- a very useful datapoint when making informed decisions.
Psst*: Star and watch the Middleware repo and never miss out any cool updates and resources from the MW team!*
Step 3: Loading and Preparing Your Data Let's start by loading the JSON data containing our metrics into the environment:
import json
import pandas as pd
from langchain_chroma import Chroma
from langchain_community.llms import Ollama
from langchain_core.documents import Document
from langchain_community.embeddings import SentenceTransformerEmbeddings
# Load JSON data from file
f = open(r'data.json', 'r')
data = json.load(f)
# Convert JSON to DataFrames for each metric
lead_time_df = pd.DataFrame(data['lead_time_stats']).T
mean_time_to_restore_df = pd.DataFrame(data['mean_time_to_restore_stats']).T
change_failure_rate_df = pd.DataFrame(data['change_failure_rate_stats']).T
deployment_frequency_df = pd.DataFrame(data['deployment_frequency_stats']).T
prs_df = pd.DataFrame(data['lead_time_prs'])
Step 4: Generating Document Embeddings We'll use SentenceTransformerEmbeddings from LangChain to create embeddings from the documents:
# Initialize the SentenceTransformerEmbeddings model
embedding_model = SentenceTransformerEmbeddings(
model_name="all-MiniLM-L6-v2"
)
# Create documents for embeddings
docs = [
Document(
page_content=f"PR {pr['number']} titled '{pr['title']}' by {pr['author']['username']}",
metadata=pr
)
for pr in data['lead_time_prs']
]
# Extract texts for embedding
texts = [doc.page_content for doc in docs]
# Generate embeddings for each document
embeddings = embedding_model.embed_documents(texts)
Step 5: Simplifying Metadata for Analysis To ensure compatibility with analytical tools like Chroma, simplify the metadata by converting complex data types into simple ones:
# Manually simplify metadata
def simplify_metadata(metadata):
simplified = {}
for key, value in metadata.items():
# Only keep simple types
if isinstance(value, (str, int, float, bool)):
simplified[key] = value
elif isinstance(value, dict):
# Flatten nested dictionaries
for subkey, subvalue in value.items():
if isinstance(subvalue, (str, int, float, bool)):
simplified[f"{key}_{subkey}"] = subvalue
return simplified
simplified_metadata = [simplify_metadata(doc.metadata) for doc in docs]
Step 6: Initializing the Vector Store With the embeddings and metadata prepared, initialize the Chroma vector store:
# Initialize Chroma vector store with the embeddings function
vectorstore = Chroma(embedding_function=embedding_model)
# Add texts and embeddings to the vector store
vectorstore.add_texts(texts, metadatas=simplified_metadata, embeddings=embeddings)
Step 7: Querying the RAG System You can now query your RAG system using natural language questions:
# Define a function to query the RAG
def query_rag(question: str) -> str:
llm = Ollama()
context = vectorstore.similarity_search(question, k=5)
context_texts = [doc.page_content for doc in context] # Extract context texts
# Concatenate context texts into a single prompt
prompt = f"Context: {' '.join(context_texts)}\nQuestion: {question}"
answer = llm.generate([prompt])
return answer
# Example questions
questions = [
"How are my 3 teams i.e. Prometheus, Kubernetes and Grafana --- doing according to the data? Use their four keys i.e. Deployment Frequency, Lead Time forChanges, Change Failure Rate and Time to Restore Service to formulate your answer.",
"Which team is struggling: Prometheus, Kubernetes and Grafana, in terms of lead time and can we say why?",
"What's working in the other teams that the struggling team from above could learn from?",
"How is the overall deployment frequency across all teams?",
"Are there any teams that are struggling with deployments? What are the current bottlenecks in our development pipeline?",
"Which pull requests across Prometheus, Kubernetes and Grafana are taking the longest to merge, and can we say why?",
"How does our current change failure rate compare to last week?",
"Are there any patterns in first response times that could indicate potential bottlenecks in the review process?",
"Which teams need additional support or resources? Highlight teams with high workloads, frequent overtime, or declining performance metrics."
]
# Query the RAG system and print answers
for question in questions:
answer = query_rag(question)
print(f"Question: {question}\n Answer: {answer}\n")
Why Tracking Engineering Metrics Matter
The insights derived from engineering metrics are more than just numbers---they represent the pulse of your development processes. Many organizations, despite being fast-paced, lack visibility into their bottlenecks and inefficiencies, leading to reactive management and missed opportunities for improvement.
MiddlewareHQ provides a platform that encourages proactive monitoring and continuous improvement by delivering lucid insights that help identify issues before they escalate. This approach allows engineering leaders to foster a culture of innovation and efficiency, ensuring that their teams not only meet deadlines but also drive meaningful change.
Star and watch the Middleware repo and get notified of updates, events and resources from the MHQ team! 10 subscribers every month win Middleware merch delivered straight to their doorstep.
Now for what you've been waiting for, let's see what our RAG has to say about the performance of our brave little projects:
Question 1. How are my 3 teams i.e. Prometheus, Kubernetes, and Grafana --- doing according to the data? Use their four keys i.e. Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Time to Restore Service to formulate your answer. Answer: Based on the data:
Deployment Frequency: Current: The average daily deployment frequency is 28, with an average monthly deployment of 840 and a weekly deployment frequency of 196.
Previous: The previous period had a higher average daily deployment of 41, suggesting that the deployment frequency has decreased recently. This could indicate that teams are slowing down in their deployment cadence.
Lead Time for Changes: The lead time has increased significantly, from around 150.01 hours in the previous period to 209.51 hours currently. This suggests that the teams are taking longer to deliver changes, which could be due to various factors such as more complex changes, increased rework, or delays in the review and merge process.
Change Failure Rate: The current change failure rate is approximately 46.5%, which is a significant increase from the previous rate of about 16.2%. This indicates that nearly half of the deployments are failing, which is a serious concern and could be indicative of quality or stability issues in the codebase.
Time to Restore Service: The mean time to recovery (MTTR) has also increased. The current MTTR is approximately 139.88 hours, up from a period where there were no recorded incidents. This suggests that when incidents do occur, they are taking much longer to resolve.
Question 2. Which team is struggling: Prometheus, Kubernetes, and Grafana, in terms of lead time and can we say why?
Answer: Based on the provided data:
The lead time has increased significantly, with Grafana struggling the most with lead time. Some of the PRs in Grafana have extremely high cycle times, indicating a delay in progressing from the initial commit to final deployment.
The increased rework time in Grafana PRs suggests that there might be more issues that require addressing after the initial review, leading to longer lead times.
There could be several reasons for this, including complex PRs that require extensive review, delays in the response from reviewers, or a backlog in the merge process.
Question 3. What's working in the other teams that the struggling team from above could learn from?
Answer: Kubernetes and Prometheus appear to have shorter cycle times and more efficient review processes based on their faster merge times and lower rework times.
Grafana could benefit from adopting practices from these teams, such as:
Improving the review process to ensure faster feedback cycles.
Streamlining and standardizing the merge process to reduce bottlenecks.
Addressing rework issues more proactively by involving multiple reviewers or implementing pair programming.
Kubernetes has a mature issue triaging process built that both teams could learn from.
Question 4. How is the overall deployment frequency across all teams?
Answer: The deployment frequency across all teams---Prometheus, Kubernetes, and Grafana---provides an essential snapshot of how often changes are being delivered to production. Here's what the data shows:
Current Deployment Frequency: Total Deployments: 430 deployments. Average Daily Deployments: 28 deployments. Average Weekly Deployments: 196 deployments. Average Monthly Deployments: 840 deployments.
Previous Deployment Frequency: Total Deployments: 619 deployments. Average Daily Deployments: 41 deployments. Average Weekly Deployments: 287 deployments. Average Monthly Deployments: 1,230 deployments.
Trend Analysis: Significant Decrease in Deployment Frequency: The deployment frequency has decreased from 619 to 430 deployments---a 30.5% drop. This reduction is evident across daily, weekly, and monthly deployment frequencies.
Potential Causes:
Extended Lead Times: The increase in lead time for changes across the board means that it takes longer to get changes from development to production, leading to fewer deployments.
Increased Change Failure Rate: The current change failure rate is significantly higher (46.5%) compared to the previous period (16.2%). This higher failure rate likely contributes to the reduced deployment frequency, as teams may be more cautious or experience delays in deploying due to the need for additional fixes and rework.
Rework Cycles: The data indicates longer rework cycles, suggesting that issues identified late in the process are causing delays in deployment, further reducing frequency. The overall deployment frequency across all teams has decreased markedly. This decrease is likely due to a combination of longer lead times, increased change failure rates, and higher rework cycles. These factors are contributing to a slower pace of getting changes into production.
Question 5. Are there any teams that are struggling with deployments more than the rest? What are the current bottlenecks in our development pipeline?
Answer: The decrease in deployment frequency across all teams and the high change failure rate suggest that deployments are becoming more challenging for all teams. With the help of the correlating data, we can conclude that Grafana might be struggling most with deployments. Bottlenecks identified include:
Lead time: Increased lead times across the board suggest delays in getting code changes to deployment.
Rework time: A significant portion of the lead time is spent on rework, which could be due to quality issues or incomplete initial implementations.
Review delays: Delays in the first response and merge times could be causing these bottlenecks, indicating a need for more reviewers or better prioritization of PRs.
However, it must be noted that all three repositories seem mature in most aspects of their operation and thus a few kinks in their deployment metrics could just be a function of their operation and expected by the team members.
Question 6. Which pull requests across Prometheus, Kubernetes, and Grafana are taking the longest to merge, and can we say why?
Answer: The data reveals several pull requests with extended cycle times from creation to merging. Here's a detailed look at some of the PRs with the longest merge times and the factors contributing to these delays.
1. Grafana - "OpenTSDB: Fix data frame construction" PR (#90991):
Cycle Time: 1,288,230 seconds (~14.9 days) Merge Time: 34 seconds (after a long cycle time) Rework Cycles: 1 rework cycle. Rework Time: 10,237 seconds (~2.8 hours)
Analysis: Review Delays: The PR had a substantial delay between its first commit and the first response, taking 1,277,959 seconds (~14.8 days) for the first response. This indicates a bottleneck in the review process, likely due to the complexity of the change or reviewer availability.
Complexity: The rework cycle suggests that issues were discovered late, necessitating further changes, which extended the cycle time.
2. Grafana - "K8s: Data Plane Aggregator" PR (#91228):
Cycle Time: 1,189,567 seconds (~13.8 days) Merge Time: 3,189 seconds (~53 minutes) Rework Cycles: No rework cycles were recorded, but the cycle time is still significant.
Analysis: Complex Changes: The PR involves substantial changes, with 5,554 additions and 56 deletions across 64 files, indicating a high level of complexity.
Review Process: The PR took a long time to progress from the first commit to its eventual merge, suggesting that either the changes were difficult to review or there were delays in assigning a reviewer.
Reviewer Bottleneck: The initial response time was delayed by 62,987 seconds (~17.5 hours), which, while not excessively long, may have compounded with the complexity of the PR to delay the overall merge time.
3. Kubernetes - "Handle all mock header generation with mockery" PR (#126540):
Cycle Time: 911,861 seconds (~10.6 days) Merge Time: 4,074 seconds (~1.1 hours) Rework Cycles: No rework cycles.
Analysis: PR Complexity: Although the number of additions and deletions is relatively small, the PR involved critical changes to how mock headers are generated, which likely required careful review.
Rework Time: The rework time was extensive at 877,648 seconds (~10.2 days), indicating that the PR underwent significant revision before it was finally merged.
Review Delays: The initial review response was delayed by 30,139 seconds (~8.4 hours), but the bulk of the delay appears to have been due to the need for careful revision and rework.
4. Prometheus - "Fix tsdb/db_test.go to satisfy Windows FS" PR (#14649):
Cycle Time: 597,751 seconds (~6.9 days) Merge Time: 2,366 seconds (~39 minutes) Rework Cycles: No rework cycles.
Analysis: Testing Delays: The PR's long cycle time may be attributed to the need to ensure compatibility with Windows File System, a potentially time-consuming task given the platform-specific issues.
Review Process: The PR had a relatively short initial response time of 87 seconds, suggesting that the delay was more likely due to the complexity of the testing process rather than reviewer availability.
Common Themes and Causes of Delays:
Complexity of Changes: The PRs with the longest cycle times generally involve complex changes that touch multiple parts of the codebase or involve critical components. This complexity often necessitates more in-depth reviews, testing, and rework, leading to longer merge times. This is an expected delay and is thus not a cause for concern.
Review Bottlenecks: Delays in assigning reviewers or in the initial review process are evident in several of these PRs. This could be due to a shortage of available reviewers, particularly for complex or specialized areas of the codebase.
In some cases, the first response time is reasonably quick, but subsequent review cycles or rework introduces significant delays.
Rework Cycles: Rework cycles, as seen in the Grafana PRs, significantly extend the overall time to merge. This suggests that issues are often identified late in the process, possibly due to initial oversights or the complexity of the changes.
Addressing the root causes of rework, such as improving initial code quality or enhancing automated testing, could help reduce these delays.
Testing Delays: For PRs like the one in Prometheus, extensive testing (especially for platform-specific changes) can be a significant source of delay. Ensuring that the code works across different environments may require additional cycles of testing and validation.
Question 7. How does our current change failure rate compare to last week?
Answer:
Current vs. Previous Change Failure Rate:
Current Period:
Change Failure Rate: 46.5% (0.465) Failed Deployments: 2 failed deployments. Total Deployments: 430 deployments.
Previous Period:
Change Failure Rate: 16.2% (0.162) Failed Deployments: 1 failed deployment. Total Deployments: 619 deployments.
The current change failure rate is significantly higher than the previous period, jumping from 16.2% to 46.5%. This sharp increase is alarming and suggests that recent changes might be more error-prone or that the codebase has become less stable.
Question 8. Are there any patterns in first response times that could indicate potential bottlenecks in the review process?
Answer: The first response times have varied significantly, with some PRs receiving responses almost immediately and others waiting for much longer periods. This inconsistency suggests potential bottlenecks, such as:
Reviewer availability or prioritization issues.
A lack of automated tools to assist with the initial triage and assignment of PRs to available reviewers.
Overall, the data indicates that Grafana might be the team struggling the most with several key metrics. There is a clear need to address the bottlenecks in the review process and provide additional support to improve deployment frequency and reduce the lead time and change failure rate across the board.
Question 9. Which teams need additional support or resources? Highlight teams with high workloads, frequent overtime, or declining performance metrics.
Answer: Grafana shows signs of needing additional support, given the longer lead times, high rework cycles, and increased time to recovery. The team might benefit from additional resources, whether in the form of more reviewers, more automated testing, or better tooling to manage complex PRs. It would be possible to pinpoint exact areas that Grafana would need aid in with more data like comments from each Pull Request, Issues and Incidents.
Prometheus and Kubernetes also show some concerning trends but not to the same extent as Grafana.