AI for DevOps Engineers - Part 3: Infrastructure, Operations, Security, and Agents


Bicycle

In the previous parts (part one and part two) of this blog series, we explored the challenges facing DevOps today, how AI can address them, and how to build powerful AI applications using frameworks like LangChain. Now, in this final part, we'll look at infrastructure options for hosting AI applications, optimising their performance, enabling guardrails for secure interaction, and using AI agents to automate complex workflows.

Hosting Options for AI Applications

LLM Hosting

When deploying AI applications, one of the first decisions is choosing the hosting option for our language model. There are two main approaches: cloud-based models and self-hosted models.

Cloud-Based Models

Cloud-based models are provided by companies like OpenAI, Anthropic, Azure AI, and even more cloud providers. These services are popular for their ease of use and scalability.

Advantages:

  • Ease of Use: We simply sign up, connect the app to the service API, and we're ready to go.
  • Scalability: Handle anything from small projects to massive workloads.
  • Regular Updates: Providers roll out new features and improvements regularly.

Considerations:

  • Internet Dependency: Requires a stable internet connection.
  • Limited Control: We're tied to a third-party service, which could mean limited control over the model.
  • Data Privacy: Sensitive data like passwords or user information may require additional safeguards.

Self-Hosted Models

Self-hosting involves running the model on our own infrastructure using tools like Ollama, Llama.cpp, or LM Studio.

Advantages:

  • Full Control: Customize the model and data flow to suit your needs.
  • Enhanced Privacy: Data stays within our environment.

Challenges:

  • Technical Expertise: Requires knowledge to set up and maintain the system.
  • Hardware Requirements: Needs appropriate hardware, such as GPUs.
  • Maintenance Responsibility: We’re responsible for updates and smooth operation.

Decision Factors:

  • Security Needs: For sensitive data, self-hosting offers better control. If using cloud-based models, we might consider anonymizing or hashing data.
  • Scalability: Cloud solutions are better for large or unpredictable workloads.
  • Cost: It's a good idea to compare cloud subscription costs with the expenses of maintaining local infrastructure.

Optimizing Performance: Inference Speed

Inference speed refers to how quickly a model processes and generates responses. Several factors influence this:

  • Hardware Acceleration: Using GPUs or Tensor Processing Units (TPUs) can significantly speed up inference through parallelized computations. This is especially useful for large-scale or complex applications, as CPUs are slower due to sequential processing.
  • Model Size: LLMs contain billions of parameters. Smaller models generally provide faster inference times. While they may sacrifice some accuracy, they are practical for real-time or resource-constrained environments.
  • Quantization: Quantization reduces the precision of model weights (e.g., from 32-bit to 8-bit), improving speed and reducing memory usage with minimal performance loss.
  • Caching: Caching stores common responses or intermediate results, saving computation time for repeated queries and improving efficiency.

Managing Multiple Models: Language Model Proxying

If the application requires different models for various tasks, language model proxying can help. This technique intelligently routes requests to specific models based on predefined factors or tasks.

LiteLLM: Simplifying Multi-Model Management

LiteLLM is an open-source framework designed to simplify working with multiple language models. It provides a standardized API to call over 100 different LLMs, such as OpenAI, Anthropic, Google Gemini, and Hugging Face.

Benefits:

  • Unified Interface: Consistent response formatting across providers.
  • Simplified Management: We can easily switch between models without rewriting code.
  • Advanced Features: Includes automatic retry, fallback mechanisms, and spend tracking.

Example: Using LiteLLM for Multi-Model Applications

Here’s how we can use LiteLLM to manage multiple LLMs in a single application:

 1import streamlit as st
 2from litellm import completion
 3
 4st.title("Multi-Model Chat")
 5
 6# LiteLLM Completion function to get model response
 7def get_model_response(model_name: str, prompt: str) -> str:
 8  response = completion(model=model_name, messages=[{"role": "user", "content": prompt}])
 9  return response.choices[0].message.content
10
11# Streamlit UI Selection for model
12model_option = st.selectbox("Choose a language model:", ("gpt-3.5-turbo", "ollama/llama2", "gpt-4o"))
13
14# Chat history session state
15if 'chat_history' not in st.session_state:
16  st.session_state['chat_history'] = []
17
18user_input = st.text_input("You:")
19
20# Send user input and get model response
21if st.button("Send") and user_input:
22  st.session_state['chat_history'].append({"role": "user", "content": user_input})
23  with st.spinner("Thinking..."):
24    response = get_model_response(model_name=model_option, prompt=user_input)
25  st.session_state['chat_history'].append({"role": "model", "content": response})
26
27for message in reversed(st.session_state['chat_history']):
28  st.write(f"{message['role'].capitalize()}: {message['content']}")

In this example, we create a simple chat application that allows users to choose from different language models. The get_model_response function sends the user input to the selected model and returns the response. The chat_history session state retains the conversation history, and the Streamlit interface displays the chat messages in an interactive web UI.

Monitoring LLM Applications

LLM Monitoring

Once our AI application is running, monitoring its performance is critical to ensure reliability and efficiency.

Key Metrics to Monitor

Performance Metrics:

  • Response Time: How quickly the model generates a response.
  • Throughput: Number of requests handled in a given time.
  • Latency: Time to first token and total generation time.
  • Error Rates: Track failed requests or timeouts to identify instability.

User Engagement Metrics:

  • User Retention Rates: Frequency of users returning.
  • Session Duration: Average time users spend interacting.
  • Interaction Frequency: How often users engage.
  • User Feedback Scores: Direct user input for improvements.

Observability:

  • Comprehensive Logging: Capture detailed system behavior.
  • Distributed Tracing: Track requests to identify bottlenecks or failures.
  • Real-Time Dashboards: Visualize metrics and respond to issues.

Tools for Monitoring

We can use AI-powered tools like Grafana or BigPanda for effective monitoring and analysis.

Cost Considerations for LLM Applications

Running LLM applications can be expensive, so understanding and managing costs is essential.

Key Cost Factors

Cost per Token/Character:

  • Most LLM APIs charge based on the number of tokens processed (input + output).
  • Advanced models (e.g., GPT-4) cost more than simpler ones (e.g., GPT-3.5).

Volume Discounts:

  • Tiered pricing structures reduce costs as usage increases.

Additional Features:

  • Fine-tuning or specialized models enhance performance but add costs.

Usage Limits/Quotas:

  • Exceeding limits can lead to unexpected charges or interruptions.

Best Practices

  • Compare costs between providers to find the best fit.
  • Use caching to reduce API calls and save on token usage.
  • Monitor usage to stay within limits and avoid unexpected charges.
  • Understand limits to design scalable applications.
  • For high-traffic apps, we should consider higher-tier plans or strategies like batching requests and caching responses.

Security Challenges for Publicly Accessible LLM Applications

LLM Security

Public-facing LLM applications come with unique security challenges.

Potential Threats

  • Prompt Injections: Malicious users may manipulate the model by injecting harmful or inappropriate prompts.
  • Sensitive Data Exposure: Models may reveal sensitive information if not properly secured.

Mitigation Strategies

Input Validation and Sanitization:

  • Filter out malicious prompts before processing.
  • Monitor and log interactions to detect suspicious activity.

Generation Guardrails:

  • Define content policies to filter inappropriate content.
  • Use tools like toxicity detectors to classify and block harmful outputs.

Output Validation:

  • Re-check generated content for harmful or inappropriate outputs.
  • Use content moderation APIs to ensure ethical and safe responses.

Agents: Automating Complex Workflows

AI agents go beyond generating text — they perform tasks autonomously, enabling complex, multi-step workflows. Let’s explore their key characteristics and how to build them.

Key Characteristics of AI Agents

  • Autonomy: Make decisions independently.
  • Goal-Oriented: Focus on completing tasks.
  • Interactivity: Respond to changes in their environment.
  • Adaptability: Learn and improve over time.

Key Components of an AI Agent

AI Agent

An AI agent consists of several components that work together to perform tasks effectively:

Tools

  • Access to various tools (e.g., Calendar, CodeInterpreter, Website Scraping, APIs, etc.).
  • Enable tasks ranging from simple calculations to complex problem-solving.
  • Dynamically select and use tools as needed.

Memory

  • Short-Term Memory: Stores information for immediate tasks.
  • Long-Term Memory: Retains knowledge over time, enabling learning and adaptation.

Planning

  • Creates strategies for task execution.
  • Breaks down complex tasks into smaller, manageable steps.

Execution and Feedback

  • Executes planned actions using tools and memory.
  • Feedback Loop: Results feed back into planning, allowing dynamic refinement.

LangGraph: A Visual Tool for Agent Workflows

LangGraph is also a component of the LangChain ecosystem that provides a "visual" interface for designing and managing agent workflows. It simplifies the process of connecting different components and orchestrating complex tasks.

Capabilities of LangGraph:

  • Interact with Users: Handle user inputs and provide meaningful responses.
  • Access External Services: Integrate with APIs, databases, or other tools.
  • Perform Tasks Autonomously: Execute multi-step workflows without manual intervention.
  • Building Graphs: Define states, transitions, and actions to guide the agent's behavior. We can also build mulit-agent systems with different forms of communication between agents.

Example: Research and Summarization Agent

This agent performs web research and summarizes the results.

 1from langchain.chat_models import ChatOpenAI
 2from langchain.tools import DuckDuckGoSearchRun
 3from langgraph.graph import StateGraph, END
 4from typing import TypedDict, Annotated, List
 5import operator
 6
 7# Initialize models and tools
 8research_model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
 9summary_model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.3)
10search_tool = DuckDuckGoSearchRun()
11
12# Define graph state structure
13class ResearchState(TypedDict):
14    query: str
15    research_results: Annotated[List[str], operator.add]
16    summary: str
17
18graph = StateGraph(ResearchState)
19
20# Define agent functions
21def research_agent(state):
22    query = state["query"]
23    search_result = search_tool.run(query)
24    return {"research_results": [search_result]}
25
26def summarization_agent(state):
27    research_results = state["research_results"]
28    summary_prompt = f"Summarize the following research results:\n{research_results}"
29    summary = summary_model.predict(summary_prompt)
30    return {"summary": summary}
31
32# Build the graph
33graph.add_node("research", research_agent)
34graph.add_node("summarize", summarization_agent)
35graph.set_entry_point("research")
36graph.add_edge("research", "summarize")
37graph.add_edge("summarize", END)
38
39# Compile and run
40app = graph.compile()
41
42def run_research(query):
43    result = app.invoke({"query": query, "research_results": [], "summary": ""})
44    return result["summary"]
45
46# Example usage
47research_topic = "Latest advancements in AI"
48summary = run_research(research_topic)
49print(f"Research Summary on '{research_topic}':\n{summary}")

In this example, we define a research agent that performs a web search on a given topic and summarizes the results. The agent uses LangGraph to create a state machine that guides the workflow through research and summarization steps. The run_research function initiates the agent with a research query and returns the final summary.

This is what the visual representation of this simple agent workflow would look like:

LangGraph Example

Conclusion

In this final part of our blog series, we explored the critical aspects of deploying and managing AI applications for DevOps engineers. From hosting options and performance optimization to monitoring, cost management, and security, we covered the essential considerations for building reliable and efficient AI-powered systems. Additionally, we talked about AI agents, showcasing their ability to automate complex workflows and adapt to dynamic environments.

Key Takeaways from the Blog Series

This blog series explored how AI is transforming DevOps, from addressing challenges to building and deploying advanced AI applications. Let's quickly recap the key takeaways:

Part 1: The Building Blocks of DevOps AI

  • Challenges in DevOps: Manual processes, delayed issue detection, skill gaps, scalability issues, and security vulnerabilities.
  • AI’s Role: Automates tasks, improves collaboration, predicts issues, and enhances security.
  • Generative AI & LLMs: Large Language Models enable content creation, language understanding, and innovative workflows.
  • RAG: Combines LLMs with external knowledge for accurate, context-aware responses.

Part 2: Building AI Applications with LangChain

  • LangChain Framework: Simplifies LLM-powered app development with tools for document loading, vector stores, prompts, chains, and agents.
  • Practical Examples: Chat apps, Retrieval-Augmented Generation (RAG), and interactive PDF chat apps.
  • Ensuring Quality: Testing, evaluation metrics, and tools like LangSmith ensure reliability and performance.

Part 3: Infrastructure, Operations, Security, and Agents

  • Hosting Options: Cloud-based models offer scalability, while self-hosted models provide control and privacy.
  • Performance Optimization: Techniques like hardware acceleration, quantization, and caching improve efficiency.
  • Operational Excellence: Considering monitoring tools, cost management, and API limits ensure smooth operations.
  • Security Best Practices: Guardrails, input validation, and output moderation protect public-facing applications.
  • AI Agents: Autonomous systems that perform complex workflows, leveraging tools, memory, and planning.

Thank you for joining us on this journey! Stay curious and feel free to reach out if you have any questions or need further guidance.

Again, if you're hungry for more details, make sure to check out our video-recordings of our latest AI for Devops Engineers Workshop on YouTube:

Go Back explore our courses

We are here for you

You are interested in our courses or you simply have a question that needs answering? You can contact us at anytime! We will do our best to answer all your questions.

Contact us