nano-qwen3-serving

🚀 Nano Qwen3 Serving

A high-performance, OpenAI-compatible API server for the nano Qwen3 serving engine, supporting multiple backends (CUDA, MPS, CPU) for efficient local LLM inference.

✨ Features

🚀 OpenAI-Compatible API: Drop-in replacement for OpenAI API
⚡ Real-time Streaming: Server-sent events for live token generation
🔧 Multi-Backend Support: CUDA (NVIDIA), MPS (Apple Silicon), CPU
🎯 High Performance: Efficient memory management and request batching
🔧 Multiple Models: Support for various Qwen3 model sizes
📊 Health Monitoring: Built-in health checks and performance statistics
🔄 Async Support: Full async/await support for high concurrency
🛡️ Production Ready: Error handling, logging, and monitoring

🚀 Quick Start

Prerequisites

Python 3.8+
CUDA: NVIDIA GPU with CUDA support
MPS: Apple Silicon Mac (M1/M2/M3) with macOS 12.3+
CPU: Any system with sufficient RAM
8GB+ RAM (16GB+ recommended)

Installation

# Clone the repository
git clone https://github.com/hsliuustc/nano-qwen3-serving.git
cd nano-qwen3-serving

# Install dependencies
pip install -r requirements.txt

Start the Service

# Start with auto-detection (recommended)
python tools/start_service.py

# Start with specific device
python tools/start_service.py --device cuda
python tools/start_service.py --device mps
python tools/start_service.py --device cpu

# Start on custom port
python tools/start_service.py --port 8001

# Start with different model
python tools/start_service.py --model Qwen/Qwen3-1.5B --device auto

Test the API

# Health check
curl -X GET http://127.0.0.1:8000/health

# List available models
curl -X GET http://127.0.0.1:8000/v1/models

# Basic chat completion
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-0.6b",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "max_tokens": 50
  }'

📡 API Endpoints

Chat Completions

POST /v1/chat/completions

Generate chat completions with conversation context.

curl -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-0.6b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is machine learning?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7,
    "stream": false
  }'

Streaming Chat Completions

Enable real-time token generation:

curl -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "qwen3-0.6b",
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ],
    "stream": true,
    "max_tokens": 100
  }'

Legacy Completions

POST /v1/completions

curl -X POST http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-0.6b",
    "prompt": "The future of artificial intelligence is",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Service Information

GET /v1/models - List available models
GET /health - Health check
GET /stats - Performance statistics

🐍 Python Client Examples

Basic Usage

import requests

# Chat completion
response = requests.post(
    "http://127.0.0.1:8000/v1/chat/completions",
    json={
        "model": "qwen3-0.6b",
        "messages": [
            {"role": "user", "content": "Hello!"}
        ],
        "max_tokens": 50
    }
)

result = response.json()
print(result["choices"][0]["message"]["content"])

Streaming Usage

import requests
import json

# Streaming chat completion
response = requests.post(
    "http://127.0.0.1:8000/v1/chat/completions",
    json={
        "model": "qwen3-0.6b",
        "messages": [
            {"role": "user", "content": "Tell me a story"}
        ],
        "stream": True,
        "max_tokens": 100
    },
    headers={"Accept": "text/event-stream"},
    stream=True
)

for line in response.iter_lines():
    if line:
        line = line.decode('utf-8')
        if line.startswith('data: '):
            data_str = line[6:]
            if data_str == '[DONE]':
                break
            try:
                chunk = json.loads(data_str)
                if 'choices' in chunk and chunk['choices']:
                    delta = chunk['choices'][0].get('delta', {})
                    if 'content' in delta:
                        print(delta['content'], end='', flush=True)
            except json.JSONDecodeError:
                continue

OpenAI Client Compatibility

import openai

# Configure to use local server
openai.api_base = "http://127.0.0.1:8000/v1"
openai.api_key = "dummy-key"  # Not used by local server

# Use like OpenAI API
response = openai.ChatCompletion.create(
    model="qwen3-0.6b",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
    max_tokens=50
)

print(response.choices[0].message.content)

🔧 Configuration

Service Options

Option	Default	Description
`--host`	`127.0.0.1`	Host to bind to
`--port`	`8000`	Port to bind to
`--model`	`Qwen/Qwen3-0.6B`	Model name or path
`--device`	`mps`	Device (mps, cpu)
`--dtype`	`float16`	Data type
`--max-queue-size`	`1000`	Maximum request queue size
`--num-blocks`	`1024`	Number of memory blocks
`--block-size`	`16`	Block size
`--max-seq-length`	`4096`	Maximum sequence length

Request Parameters

Parameter	Type	Default	Description
`model`	string	required	Model name
`messages`	array	required	Chat messages
`max_tokens`	integer	100	Maximum tokens to generate
`temperature`	float	1.0	Sampling temperature (0-2)
`top_p`	float	1.0	Top-p sampling (0-1)
`stream`	boolean	false	Enable streaming
`stop`	string/array	null	Stop sequences
`presence_penalty`	float	0.0	Presence penalty (-2 to 2)
`frequency_penalty`	float	0.0	Frequency penalty (-2 to 2)

📊 Performance

Benchmarks (Apple M2 Pro)

Model	Tokens/sec	Memory Usage	Latency
Qwen3-0.6B	~25	~2GB	~50ms
Qwen3-1.5B	~15	~4GB	~80ms
Qwen3-3B	~8	~8GB	~120ms

Memory Management

The service uses efficient block-based memory management:

Dynamic Allocation: Memory blocks allocated on-demand
Garbage Collection: Automatic cleanup of unused blocks
Cache Optimization: KV cache management for better performance

🛠️ Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    API Layer                                    │
├─────────────────────────────────────────────────────────────────┤
│  FastAPI Server  │  OpenAI Service  │  AsyncLLM               │
│  (HTTP/WebSocket)│  (Request/Resp)  │  (Async Interface)      │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Core Layer                                   │
├─────────────────────────────────────────────────────────────────┤
│  LLM (High-level)  │  LLMEngine (Orchestrator)  │  AsyncLLMEngine │
│  (User Interface)  │  (Request Management)      │  (Async Engine) │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                  Execution Layer                                │
├─────────────────────────────────────────────────────────────────┤
│  ModelRunner  │  DeviceManager  │  Scheduler  │  BlockManager   │
│  (Inference)  │  (CUDA/MPS/CPU) │  (Queuing)  │  (Memory)       │
└─────────────────────────────────────────────────────────────────┘

Core Components

API Layer

FastAPI Server: HTTP/WebSocket endpoints (/v1/chat/completions, /v1/completions)
OpenAI Service: OpenAI-compatible request/response handling
AsyncLLM: High-level async interface for concurrent processing

Core Layer

LLM: High-level synchronous interface for text generation
LLMEngine: Core orchestrator managing requests, batching, and inference
AsyncLLMEngine: Async wrapper for concurrent request handling

Execution Layer

ModelRunner: Model execution and inference on multiple backends
DeviceManager: Multi-backend support (CUDA, MPS, CPU) with auto-detection
Scheduler: Request queuing, prioritization, and batch scheduling
BlockManager: Efficient KV cache memory management

📚 Examples

See the examples/ directory for comprehensive usage examples:

examples/openai_client_examples.py - Complete client examples
examples/basic_usage.py - Basic usage patterns
examples/streaming_example.py - Streaming examples

📖 Documentation

📖 Full Documentation: https://nano-qwen3-serving.readthedocs.io/

🌐 Alternative: GitHub Pages

Quick Start: Get up and running in minutes
Installation: Detailed installation guide
API Reference: Complete API documentation
Examples: Usage examples
Contributing: How to contribute
Troubleshooting: Common issues and solutions

🚀 Deployment

Local Development

# Install in development mode
pip install -e .

# Run with auto-reload
python tools/start_service.py --reload

Production Deployment

# Run with multiple workers
python tools/start_service.py --workers 4 --host 0.0.0.0

# Behind reverse proxy (nginx)
location / {
    proxy_pass http://127.0.0.1:8000;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
}

Docker Deployment

FROM python:3.12-slim

WORKDIR /app
COPY . .

RUN pip install -r requirements.txt

EXPOSE 8000

CMD ["python", "tools/start_service.py", "--host", "0.0.0.0"]

🛠️ Troubleshooting

Common Issues

Service won’t start
- Check if port is already in use
- Verify model path is correct
- Ensure sufficient memory
Slow responses
- Check device (MPS recommended for Apple Silicon)
- Monitor memory usage
- Adjust batch size and queue settings
Streaming issues
- Ensure Accept: text/event-stream header
- Check for network timeouts
- Verify client handles streaming properly

Debug Mode

# Enable debug logging
python tools/start_service.py --log-level debug

# Check service status
curl -X GET http://127.0.0.1:8000/health

# Monitor performance
curl -X GET http://127.0.0.1:8000/stats

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

# Clone and setup
git clone https://github.com/hsliuustc/nano-qwen3-serving.git
cd nano-qwen3-serving

# Install development dependencies
pip install -r requirements.txt
pip install -e .

# Run tests
python -m pytest tests/

# Run linting
python -m flake8 nano_qwen3_serving/

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Qwen Team for the excellent Qwen3 models
OpenAI for the API specification
Apple for MPS acceleration
FastAPI for the web framework
PyTorch for the deep learning framework

📞 Support

📧 Email: your-email@example.com
💬 Discussions: GitHub Discussions
🐛 Issues: GitHub Issues

⭐ Star History

Made with ❤️ for the AI community

This site is open source. Improve this page.