Jina-serve is a framework for building and deploying AI services that communicate via gRPC, HTTP and WebSockets. Scale your services from local development to production while focusing on your core logic.
Key Features
Native support for all major ML frameworks and data types
High-performance service design with scaling, streaming, and dynamic batching
LLM serving with streaming output
Built-in Docker integration and Executor Hub
One-click deployment to Jina AI Cloud
Enterprise-ready with Kubernetes and Docker Compose support
Comparison with FastAPI
Key advantages over FastAPI:
DocArray-based data handling with native gRPC support
Built-in containerization and service orchestration
with Deployment(uses=TokenStreamingExecutor, port=12345, protocol=’grpc’) as dep:
dep.block()
Client
async def main():
client = Client(port=12345, protocol=’grpc’, asyncio=True)
async for doc in client.stream_doc(
on=’/stream’,
inputs=PromptDocument(prompt=’what is the capital of France ?’, max_tokens=10),
return_type=ModelOutputDocument,
):
print(doc.generated_text)
## Support
Jina-serve is backed by [Jina AI](https://jina.ai) and licensed under [Apache-2.0](./LICENSE).
Jina-Serve
Jina-serve is a framework for building and deploying AI services that communicate via gRPC, HTTP and WebSockets. Scale your services from local development to production while focusing on your core logic.
Key Features
Comparison with FastAPI
Key advantages over FastAPI:
Install
See guides for Apple Silicon and Windows.
Core Concepts
Three main layers:
Build AI Services
Let’s create a gRPC-based AI service using StableLM:
Deploy with Python or YAML:
Use the client:
Build Pipelines
Chain services into a Flow:
Scaling and Deployment
Local Scaling
Boost throughput with built-in features:
Example scaling a Stable Diffusion deployment:
Cloud Deployment
Containerize Services
Structure your Executor:
Configure: ```yaml
config.yml
jtype: TextToImage py_modules:
Deploy to Kubernetes
Use Docker Compose
JCloud Deployment
Deploy with a single command:
LLM Streaming
Enable token-by-token streaming for responsive LLM applications:
class PromptDocument(BaseDoc): prompt: str max_tokens: int
class ModelOutputDocument(BaseDoc): token_id: int generated_text: str
Implement streaming:
Serve and use: ```python
Server
with Deployment(uses=TokenStreamingExecutor, port=12345, protocol=’grpc’) as dep: dep.block()
Client
async def main(): client = Client(port=12345, protocol=’grpc’, asyncio=True) async for doc in client.stream_doc( on=’/stream’, inputs=PromptDocument(prompt=’what is the capital of France ?’, max_tokens=10), return_type=ModelOutputDocument, ): print(doc.generated_text)