PyTorch Repository Dora Metrics: Faster Lead Time, Sluggish First Response Time
6 min read
PyTorch is an open-source machine learning framework, originally started as an internship project for Adam Paszke. Adam Paszke was then a student of Soumith Chintala (one of the developers behind Torch and a researcher at Meta, formerly Facebook). Since its release, PyTorch has become a favorite among researchers and developers. Especially, due to its flexibility, dynamic computation graph, and user-friendly "pythonic" coding style.
In the field of Natural Language Processing (NLP), PyTorch has a strong edge, particularly with its close integration with Hugging Face’s transformer models, giving it a larger selection of pre-trained models compared to TensorFlow. As of 2023, PyTorch has played a significant role in developing some of the most talked-about models, like OpenAI's DALL-E 2, Stable Diffusion, and ChatGPT.
Since it’s the AI world and we are just living in it, it’s only fair that we analyze the engineering workflow of their GitHub repository and see what insights we can gain from it.
We used Middleware OSS to understand their Dora Metrics. Dora Metrics are a set of key performance indicators that help organizations measure their software delivery performance and effectiveness in DevOps practices. The primary Dora metrics are: Deployment Frequency, Lead Time for Changes, Mean Time to Restore, and Change Failure Rate.
To know more about Dora Metrics, check out the following blog: What are DORA Metrics? - Master DORA Metrics to Master World Class Software Delivery Processes
You can also check our live demo and see what insights you get of your software delivery pipeline using Middleware OSS.
PyTorch Dora Metrics: Exceptional Cycle and Lead Times
PyTorch showcases an exceptional cycle time and lead time. Their cycle time was 6.5 days in the month of July and 3 days in August and September.
Similarly, their lead time showed efficiency by taking the features to final push within 3.1 days in August and September, and 6.5 days in July.
Some of the features they successfully pushed in the last three months include: Enable NEON ISA detection by @kit1980, Dynamo benchmark skip logic by @zxd1997066, and Flash attention fix by @atalman
Their efficient cycle time and lead time could be because of their exemplary performance in merge time.
Also read: Lead Time Optimization 101 | Unlock Software Engineering Efficiency
First Response Time Playing a Spoilsport
Having said that, their first response time displayed a grim picture exposing loopholes in their cycle time.
In July, their first response time was 58 days, which alleviated to 26 and 3.8 days in the later months.
Though the first response time is running into days which is not a healthy sign, it seems they are taking measures to reduce the number of days as much as possible.
This is evident from the reduced first response period in the months of August and September. They can do better.
Also read: OhMyZsh: Should Up their Deployment and Lead Time to Reach Their Ambitions
What are their Strengths?
Active community of 3.6K contributors: PyTorch is currently maintained by Soumith Chintala, Gregory Chanan, Dmytro Dzhulgakov, Edward Yang, and Nikita Shulga with major contributions coming from hundreds of talented individuals in various forms and means.
A non-exhaustive but growing list needs to mention: Trevor Killeen, Sasank Chilamkurthy, Sergey Zagoruyko, Adam Lerer, Francisco Massa, Alykhan Tejani, Luca Antiga, Alban Desmaison, Andreas Koepf, James Bradbury, Zeming Lin, Yuandong Tian, Guillaume Lample, Marat Dukhan, Natalia Gimelshein, Christian Sarofeen, Martin Raison, Edward Yang, Zachary Devito.
Efficient CI/CD Processes: Pytorch's success in fast merges indicates robust CI/CD architecture.
For instance, PRs such as #137242 and #137148 highlight a consistent pattern of swift integrations without bottlenecks
Where are they falling behind? How can they Leverage their Strengths to their benefit?
Slow First Response Times
They are taking days to respond to a new PR. A slower first response means a blocker at the initial steps itself.
To improve this, they should:
Set Clear SLAs (Service Level Agreements): Establish response time goals for issues or support requests and ensure the team is aware of them. For example, commit to responding to every issue or pull request within 24 hours or less.
Assign Dedicated First Responders: They have a large active community. They should leverage this and rotate team members to act as "first responders" for incoming issues or support tickets. They can triage, acknowledge, and assign requests to the right individuals.
Too Much Time and Effort Spent on Bug Fixes
They are currently spending around 35% of their time and effort on bug fixes. It’s like getting stuck in a loop. Until and unless they don’t streamline their process and focus on quality, bugs are going to pop up and fixing them would be the only thing they would be doing. Also, resources they could utilize to push new features efficiently would be spending more time on bug fixes.
To improve this, they should:
Automate Testing: Implement automated unit tests, integration tests, and regression tests to catch bugs early in the development cycle. This reduces the manual effort needed to track down bugs later.
Implement Continuous Integration/Continuous Deployment (CI/CD): They have an efficient CI/CD pipeline. They should effectively make use of it to ensure that code changes are tested and deployed quickly, reducing delays caused by bugs surfacing late in the process. Also, they can leverage Dora Metrics to optimize their CI/CD Pipeline.
Create a Clear Bug Report Process: Streamline how bugs are reported, ensuring that each bug includes relevant details like steps to reproduce, screenshots, and logs. Well-documented bug reports make it easier to identify and fix issues faster.
Assign Experienced Developers to Critical Bugs: Senior developers or those with deep project knowledge can often diagnose and resolve critical bugs more efficiently. This frees up junior developers to focus on less urgent issues or new features.
Track Metrics and Trends: Use tools like Dora metrics to measure lead time for changes and analyze bug fix performance. This helps identify bottlenecks and improve future workflows.
Also read: Ollama says "Oh Lord" when shipping updates: A Dora Metrics case study
Our Verdict: PyTorch Excels in Lead Time but Needs Improvement in First Response Time
PyTorch’s main challenge lies in their first response time, but the solution is already within reach. They can tap into their vibrant community and robust CI/CD pipeline to address this issue. Additionally, encouraging senior contributors to handle bug fixes can help clear the backlog, allowing the team to shift focus towards new features and innovation, rather than getting bogged down in a cycle of bug fixes.
If you are also facing such engineering dilemmas, then write to us at productivity@middlewarehq.com and we would be happy to help you provide actionable insights into your workflow or you can also try tracking your Dora metrics yourself using Middleware Open Source that too for free!
Did you know?
PyTorch is heavily supported by Meta (formerly Facebook), and it plays a key role in advancing the company's AI research initiatives.