nano-qwen3-serving

πŸš€ Nano Qwen3 Serving

A high-performance, OpenAI-compatible API server for the nano Qwen3 serving engine, supporting multiple backends (CUDA, MPS, CPU) for efficient local LLM inference.

Python 3.8+ License: MIT Apple Silicon

✨ Features

πŸš€ Quick Start

Prerequisites

Installation

# Clone the repository
git clone https://github.com/hsliuustc/nano-qwen3-serving.git
cd nano-qwen3-serving

# Install dependencies
pip install -r requirements.txt

Start the Service

# Start with auto-detection (recommended)
python tools/start_service.py

# Start with specific device
python tools/start_service.py --device cuda
python tools/start_service.py --device mps
python tools/start_service.py --device cpu

# Start on custom port
python tools/start_service.py --port 8001

# Start with different model
python tools/start_service.py --model Qwen/Qwen3-1.5B --device auto

Test the API

# Health check
curl -X GET http://127.0.0.1:8000/health

# List available models
curl -X GET http://127.0.0.1:8000/v1/models

# Basic chat completion
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-0.6b",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "max_tokens": 50
  }'

πŸ“‘ API Endpoints

Chat Completions

POST /v1/chat/completions

Generate chat completions with conversation context.

curl -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-0.6b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is machine learning?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7,
    "stream": false
  }'

Streaming Chat Completions

Enable real-time token generation:

curl -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "qwen3-0.6b",
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ],
    "stream": true,
    "max_tokens": 100
  }'

Legacy Completions

POST /v1/completions

curl -X POST http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-0.6b",
    "prompt": "The future of artificial intelligence is",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Service Information

🐍 Python Client Examples

Basic Usage

import requests

# Chat completion
response = requests.post(
    "http://127.0.0.1:8000/v1/chat/completions",
    json={
        "model": "qwen3-0.6b",
        "messages": [
            {"role": "user", "content": "Hello!"}
        ],
        "max_tokens": 50
    }
)

result = response.json()
print(result["choices"][0]["message"]["content"])

Streaming Usage

import requests
import json

# Streaming chat completion
response = requests.post(
    "http://127.0.0.1:8000/v1/chat/completions",
    json={
        "model": "qwen3-0.6b",
        "messages": [
            {"role": "user", "content": "Tell me a story"}
        ],
        "stream": True,
        "max_tokens": 100
    },
    headers={"Accept": "text/event-stream"},
    stream=True
)

for line in response.iter_lines():
    if line:
        line = line.decode('utf-8')
        if line.startswith('data: '):
            data_str = line[6:]
            if data_str == '[DONE]':
                break
            try:
                chunk = json.loads(data_str)
                if 'choices' in chunk and chunk['choices']:
                    delta = chunk['choices'][0].get('delta', {})
                    if 'content' in delta:
                        print(delta['content'], end='', flush=True)
            except json.JSONDecodeError:
                continue

OpenAI Client Compatibility

import openai

# Configure to use local server
openai.api_base = "http://127.0.0.1:8000/v1"
openai.api_key = "dummy-key"  # Not used by local server

# Use like OpenAI API
response = openai.ChatCompletion.create(
    model="qwen3-0.6b",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
    max_tokens=50
)

print(response.choices[0].message.content)

πŸ”§ Configuration

Service Options

Option Default Description
--host 127.0.0.1 Host to bind to
--port 8000 Port to bind to
--model Qwen/Qwen3-0.6B Model name or path
--device mps Device (mps, cpu)
--dtype float16 Data type
--max-queue-size 1000 Maximum request queue size
--num-blocks 1024 Number of memory blocks
--block-size 16 Block size
--max-seq-length 4096 Maximum sequence length

Request Parameters

Parameter Type Default Description
model string required Model name
messages array required Chat messages
max_tokens integer 100 Maximum tokens to generate
temperature float 1.0 Sampling temperature (0-2)
top_p float 1.0 Top-p sampling (0-1)
stream boolean false Enable streaming
stop string/array null Stop sequences
presence_penalty float 0.0 Presence penalty (-2 to 2)
frequency_penalty float 0.0 Frequency penalty (-2 to 2)

πŸ“Š Performance

Benchmarks (Apple M2 Pro)

Model Tokens/sec Memory Usage Latency
Qwen3-0.6B ~25 ~2GB ~50ms
Qwen3-1.5B ~15 ~4GB ~80ms
Qwen3-3B ~8 ~8GB ~120ms

Memory Management

The service uses efficient block-based memory management:

πŸ› οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    API Layer                                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  FastAPI Server  β”‚  OpenAI Service  β”‚  AsyncLLM               β”‚
β”‚  (HTTP/WebSocket)β”‚  (Request/Resp)  β”‚  (Async Interface)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Core Layer                                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  LLM (High-level)  β”‚  LLMEngine (Orchestrator)  β”‚  AsyncLLMEngine β”‚
β”‚  (User Interface)  β”‚  (Request Management)      β”‚  (Async Engine) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Execution Layer                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  ModelRunner  β”‚  DeviceManager  β”‚  Scheduler  β”‚  BlockManager   β”‚
β”‚  (Inference)  β”‚  (CUDA/MPS/CPU) β”‚  (Queuing)  β”‚  (Memory)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

API Layer

Core Layer

Execution Layer

πŸ“š Examples

See the examples/ directory for comprehensive usage examples:

πŸ“– Documentation

πŸ“– Full Documentation: https://nano-qwen3-serving.readthedocs.io/

🌐 Alternative: GitHub Pages

πŸš€ Deployment

Local Development

# Install in development mode
pip install -e .

# Run with auto-reload
python tools/start_service.py --reload

Production Deployment

# Run with multiple workers
python tools/start_service.py --workers 4 --host 0.0.0.0

# Behind reverse proxy (nginx)
location / {
    proxy_pass http://127.0.0.1:8000;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
}

Docker Deployment

FROM python:3.12-slim

WORKDIR /app
COPY . .

RUN pip install -r requirements.txt

EXPOSE 8000

CMD ["python", "tools/start_service.py", "--host", "0.0.0.0"]

πŸ› οΈ Troubleshooting

Common Issues

  1. Service won’t start
    • Check if port is already in use
    • Verify model path is correct
    • Ensure sufficient memory
  2. Slow responses
    • Check device (MPS recommended for Apple Silicon)
    • Monitor memory usage
    • Adjust batch size and queue settings
  3. Streaming issues
    • Ensure Accept: text/event-stream header
    • Check for network timeouts
    • Verify client handles streaming properly

Debug Mode

# Enable debug logging
python tools/start_service.py --log-level debug

# Check service status
curl -X GET http://127.0.0.1:8000/health

# Monitor performance
curl -X GET http://127.0.0.1:8000/stats

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Setup

# Clone and setup
git clone https://github.com/hsliuustc/nano-qwen3-serving.git
cd nano-qwen3-serving

# Install development dependencies
pip install -r requirements.txt
pip install -e .

# Run tests
python -m pytest tests/

# Run linting
python -m flake8 nano_qwen3_serving/

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

πŸ“ž Support

⭐ Star History

Star History Chart


Made with ❀️ for the AI community