MontyCloud’s interactive CloudOps Copilot simplifies complex cloud operations by leveraging Generative AI technologies, and research from IIITH’s Software Engineering Research Centre.
Businesses are rapidly embracing the public cloud for its numerous benefits, including cost efficiency, performance, reliability, and scalability. However, while a shift to the cloud is hard enough, harder still are the challenges that customers face in managing their application stack. “This is due to the shared responsibility model required by cloud providers,” says Venkat Krishnamachari, CPO and co-founder at MontyCloud, a US-based startup involved in intelligent cloudops solutions. Venkat elaborates that while cloud providers such as Amazon Web Services secure the underlying service, host operating systems, physical hardware and datacenters, customers are required to be responsible for protecting their application stack, including the guest operating system and all application components. To help customers, cloud providers have published best practices, such as Amazon’s “Well-Architected Framework,” (WAFR) which outlines the shared responsibilities across six pillars. Despite this, customers still face challenges, such as cost overruns, security and compliance gaps, and operational overheads. It’s this space that MontyCloud fits in with its mantra of ‘helping customers innovate more and operate less’. “We are always looking to simplify the cloud operations burden with intelligent automation,” affirms Venkat.
How It Started
While the start-up has launched autonomous bots for security and compliance to operate within the latest policies and guidelines of organisations, the MontyCloud team was looking to further simplify their solution and make it more accessible to customers in the wake of Generative AI. “That’s when we connected with the team at IIITH,” recalls Kannan Parthasarthy, MontyCloud’s CTO and co-founder. There was a perfect alignment of interest with Prof. Karthik Vaidhyanathan, whose research area falls in the intersection of software engineering and machine learning. “Their problem statement was that CloudOps is a complex multi-dimensional space necessitating the need for a simple yet powerful solution,” details Prof. Vaidhyanathan, adding, “They essentially wanted an automated solution, a cloud operations copilot that could help customers find specific answers to their questions without needlessly navigating through multiple UIs, links, dashboards and others”. Having previously worked as an industry consultant on NLP-related work where he developed chat bots that doubled as sales professionals as well as assistants helping retrieve accurate data from relevant documents, the solution sought felt like an excellent match of skills and interest for Prof. Vaidhyanathan.
Industry-Research Collab
The professor assembled a team comprising PhD student Rudra Dhar along with research assistants Adyansh Kakran and Shrikara A, who are working on using generative AI for improving software architectural design and actively began exploring the use of GenAI specifically for simplifying cloud operations. Prof. Vaidhyanathan observes that while companies desirous of building LLMs know that their use cases can be solved by these models, they often seek guidance on how exactly to fine-tune the systems to solve their particular use cases. And that’s not all. Building LLMs comes with its own set of challenges. “They have a tendency to hallucinate and come up with random answers to questions,” explains Prof. Vaidhyanathan.
Speaking about the partnership with IIITH, Kannan Parthasarthy says that the IIITH team was brought in to jointly create and expedite MontyCloud’s CloudOps copilot to act as a conversational AI agent allowing users to interact with the platform beyond the traditional UX constraints. “The copilot, internally nicknamed Marvin, was also positioned to deliver simplified and actionable guidance, based on the insights found by MontyCloud’s automation for Well-Architected Framework Review,” he adds. While MontyCloud’s team built custom workflows and a pipeline to make sure the AI solution is performant and the data is secure, IIITH helped in making the copilot an intelligent agent, reliable for its results with limited or zero hallucination where possible. Together the teams prototyped, developed and launched the copilot in a record time frame,” he says.
The Edge
In December 2023, the CEO as well as the CPO and Co-founder of MontyCloud were invited as speakers to the Automations Fest, hosted by the AWS Automations Solution Business. It is a global conclave for leaders who wish to better understand current business practices and are looking to increase operational effectiveness through automation solutions. The team demonstrated the power of intelligent automation at the conference, where the centrepiece of the demo was the CloudOps Copilot. “While AWS’s framework has questions and suggested answers, MontyCloud’s own take on WAFR augments it with automated checks that run on the workload and generates a deterministic, actionable report that can be put to use to remediate the issues found,” says Venkat, emphasising that Marvin brings this complex report to every user of the platform and not just the power users of the cloud.
Observing that domain-specific LLMs are the need-of-the-hour, Prof. Vaidhyanathan admits that preserving privacy is a challenge. “This is where Amazon Bedrock really helped us…in ensuring that data and applications are kept secure and private, by design. ” Another noteworthy chasm the IIITH team helped bridge between the LLM and the end-users was by guiding GenAI with the precise kind of prompt engineering to generate the desired answers from Marvin. “I really don’t feel that it’s engineering per se, but yes, different kinds of inputs need to be experimented with to get the best results,” he says.
In the pipeline
While an autonomous agent simplifying cloud operations is great news, currently the onus is on the customer or the business to take appropriate actions based on the insights provided. “We want Marvin to evolve beyond the current chat interface, to truly understand the operational needs of a project, policy and compliance requirements without having to write complex rules, to continuously monitor the even stream of the cloud foot print and identify violations or transgressions. It ought to suggest remedial actions to perform during incidents. The idea is to maintain a human in the loop so while CloudOps management itself is automated, the human still makes the final decision,” says the professor.
Traditional Software Lifecycle Vs. ML Lifecycle
As researchers straddling software engineering and ML research, the IIITH team faced a paradigm shift in their traditional thought process. “When you build conventional software systems, you build a front-end based on the back-end data provided to you,” says Prof. Vaidhyanathan, adding that in the case of AI systems, this is no longer the case. “We are increasingly going to build interactive bots and agents. Hence our APIs and the back-end systems ought to be such that they provide data as required by the bots. It’s something we discovered during the creation of Marvin.” Another difference he points out is in the versioning of code in the traditional software system. “In ML systems, we need to version the data that is used to build the models for maintainability purposes,” he reasons. According to him when we don’t version code, data and consequently the models themselves, it becomes difficult to fathom which model is responsible for which output. Also, over time ML models have a tendency to drift or degrade necessitating their retraining at frequent intervals. “In order to capture new data, retraining models on different sets of data points is crucial in order to maintain high levels of accuracy,” avers the professor. Moreover the process of building ML systems brings its own challenges related to communication and collaboration between the ML team and the software team. As a researcher, he is optimistic with the machine learning community realising and adopting good software engineering practices for not only building ML systems but maintaining them.