Expanding the Azure AI Model Inference API: Integrating Azure AI into Your Development Environment (2024)

We believe that building AI applications should be intuitive, flexible, and integrated. To help simplify the AI development lifecycle, we are thrilled to announce the latest updates to the Azure AI model inference API, enhancing its capabilities and broadening its utility for a code-first experience that brings the power of Azure AI directly within the environments and the tools developers prefer. The Azure AI model inference API provides developers with a single, consistent API and endpoint URL, enabling seamless interaction with a wide variety of foundation models, including those from Azure OpenAI Service, Mistral, Meta, Cohere, and Microsoft Research.

This announcement focuses on integrating model access via GitHub, introducing inference SDKs, expanding support for self-hosted models, integrating with Azure API Management, and enhancing retrieval augmented generation (RAG) systems with LlamaIndex.

Key features in this announcement:

1. Single Common API for Model Access via GitHub: With the GitHub Models announcement last week, developers can now access models from the Azure AI model catalog directly through GitHub using a single API and experiment with different AI models in the playground with their GitHub account. This integration allows for seamless interaction with various models, including GPT-4o, Llama 3, Mistral Large, Command R+, and Phi-3. By unifying model access, we simplify the AI development process and enable more efficient workflows.

The integration with GitHub is particularly significant for application developers. GitHub is a central hub for coding, collaboration, and version control. By bringing Azure AI capabilities directly into GitHub, we ensure that developers can remain in their preferred development environment to experiment with multiple models, reducing context switching and streamlining the AI development process.

2. Inference SDKs for Multiple Languages: To support diverse development environments, we are introducing inference SDKs for Python, JavaScript, and C#. These SDKs allow developers to effortlessly integrate AI models into their applications using inference clients in the language of their choice, making it easier to build and scale AI solutions across different platforms. These SDKs can be integrated with LLM App development tools such as prompt flow, LangChain, and Semantic Kernel.

Inference SDKs for C#, JavaScript, and Python

3. Model inference API Expansion to Managed Compute: At Microsoft Build 2024, we introduced the Azure AI model inference API for models deployed as serverless API endpoints and Azure OpenAI Service, enabling developers to consume predictions from a diverse set of models in a consistent way and easily switch between models to compare the performance.

The model inference API now includes inference support for open source models deployed to our self-hosted managed online endpoints, providing enhanced flexibility and control over model deployment and inferencing. This feature allows you to leverage the full potential of AI models, including Mistral, Llama 3, and Phi-3 tailored to specific use cases, optimizing both performance and cost-efficiency.

4. Integration with Azure API Management: We are also expanding the GenAI Gateway capabilities in Azure API Management to support a wider range of large language models through the Azure AI model inference API, in addition to the support for Azure OpenAI Service. New policies, such as the LLM Token Limit Policy, LLM Emit Token Metric Policy, and LLM Semantic Caching Policy, provide detailed insights and control over token resources, ensuring efficient and cost-effective use of models. These policies allow for real-time monitoring, cost management, and improved efficiency by caching responses based on the semantic content of prompts. Read this blog to learn more about integration.

5. Take RAG to the next level with LlamaIndex: Lastly, we are happy to announce the integration of the Azure AI model inference API into the LlamaIndex ecosystem. Now developers can elevate their RAG systems built with LlamaIndex by leveraging the extensive power of the Azure AI model catalog.

Two new packages have been introduced to the LlamaIndex ecosystem:

llama-index-embeddings-azure-inference
llama-index-llms-azure-inference

These packages enable the seamless incorporation of Azure AI models into the LlamaIndex framework, allowing users to select the optimal model for each task.

Getting Started: To begin using the Azure AI model inference API, visit our documentation page for detailed instructions and examples. Whether you're building chatbots, data analytics tools, or sophisticated AI-driven applications, the Azure AI model inference API and SDKs provide the foundation you need to succeed.

While this announcement focuses on the GitHub integration and inference API/SDKs, we are working on bringing additional features in subsequent phases. Stay tuned for more updates as we continue to enhance the Azure AI model inference API to meet all your development needs.