A high-performance, OpenAI-compatible API server for the nano Qwen3 serving engine, supporting multiple backends (CUDA, MPS, CPU) for efficient local LLM inference.
# Clone the repository
git clone https://github.com/hsliuustc/nano-qwen3-serving.git
cd nano-qwen3-serving
# Install dependencies
pip install -r requirements.txt
# Start with auto-detection (recommended)
python tools/start_service.py
# Start with specific device
python tools/start_service.py --device cuda
python tools/start_service.py --device mps
python tools/start_service.py --device cpu
# Start on custom port
python tools/start_service.py --port 8001
# Start with different model
python tools/start_service.py --model Qwen/Qwen3-1.5B --device auto
# Health check
curl -X GET http://127.0.0.1:8000/health
# List available models
curl -X GET http://127.0.0.1:8000/v1/models
# Basic chat completion
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-0.6b",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"max_tokens": 50
}'
POST /v1/chat/completions
Generate chat completions with conversation context.
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-0.6b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
"max_tokens": 100,
"temperature": 0.7,
"stream": false
}'
Enable real-time token generation:
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "qwen3-0.6b",
"messages": [
{"role": "user", "content": "Tell me a story"}
],
"stream": true,
"max_tokens": 100
}'
POST /v1/completions
curl -X POST http://127.0.0.1:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-0.6b",
"prompt": "The future of artificial intelligence is",
"max_tokens": 50,
"temperature": 0.7
}'
/v1/models
- List available models/health
- Health check/stats
- Performance statisticsimport requests
# Chat completion
response = requests.post(
"http://127.0.0.1:8000/v1/chat/completions",
json={
"model": "qwen3-0.6b",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 50
}
)
result = response.json()
print(result["choices"][0]["message"]["content"])
import requests
import json
# Streaming chat completion
response = requests.post(
"http://127.0.0.1:8000/v1/chat/completions",
json={
"model": "qwen3-0.6b",
"messages": [
{"role": "user", "content": "Tell me a story"}
],
"stream": True,
"max_tokens": 100
},
headers={"Accept": "text/event-stream"},
stream=True
)
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data_str = line[6:]
if data_str == '[DONE]':
break
try:
chunk = json.loads(data_str)
if 'choices' in chunk and chunk['choices']:
delta = chunk['choices'][0].get('delta', {})
if 'content' in delta:
print(delta['content'], end='', flush=True)
except json.JSONDecodeError:
continue
import openai
# Configure to use local server
openai.api_base = "http://127.0.0.1:8000/v1"
openai.api_key = "dummy-key" # Not used by local server
# Use like OpenAI API
response = openai.ChatCompletion.create(
model="qwen3-0.6b",
messages=[
{"role": "user", "content": "Hello!"}
],
max_tokens=50
)
print(response.choices[0].message.content)
Option | Default | Description |
---|---|---|
--host |
127.0.0.1 |
Host to bind to |
--port |
8000 |
Port to bind to |
--model |
Qwen/Qwen3-0.6B |
Model name or path |
--device |
mps |
Device (mps, cpu) |
--dtype |
float16 |
Data type |
--max-queue-size |
1000 |
Maximum request queue size |
--num-blocks |
1024 |
Number of memory blocks |
--block-size |
16 |
Block size |
--max-seq-length |
4096 |
Maximum sequence length |
Parameter | Type | Default | Description |
---|---|---|---|
model |
string | required | Model name |
messages |
array | required | Chat messages |
max_tokens |
integer | 100 | Maximum tokens to generate |
temperature |
float | 1.0 | Sampling temperature (0-2) |
top_p |
float | 1.0 | Top-p sampling (0-1) |
stream |
boolean | false | Enable streaming |
stop |
string/array | null | Stop sequences |
presence_penalty |
float | 0.0 | Presence penalty (-2 to 2) |
frequency_penalty |
float | 0.0 | Frequency penalty (-2 to 2) |
Model | Tokens/sec | Memory Usage | Latency |
---|---|---|---|
Qwen3-0.6B | ~25 | ~2GB | ~50ms |
Qwen3-1.5B | ~15 | ~4GB | ~80ms |
Qwen3-3B | ~8 | ~8GB | ~120ms |
The service uses efficient block-based memory management:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β FastAPI Server β OpenAI Service β AsyncLLM β
β (HTTP/WebSocket)β (Request/Resp) β (Async Interface) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Core Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β LLM (High-level) β LLMEngine (Orchestrator) β AsyncLLMEngine β
β (User Interface) β (Request Management) β (Async Engine) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Execution Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ModelRunner β DeviceManager β Scheduler β BlockManager β
β (Inference) β (CUDA/MPS/CPU) β (Queuing) β (Memory) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
/v1/chat/completions
, /v1/completions
)See the examples/
directory for comprehensive usage examples:
examples/openai_client_examples.py
- Complete client examplesexamples/basic_usage.py
- Basic usage patternsexamples/streaming_example.py
- Streaming examplesπ Full Documentation: https://nano-qwen3-serving.readthedocs.io/
π Alternative: GitHub Pages
# Install in development mode
pip install -e .
# Run with auto-reload
python tools/start_service.py --reload
# Run with multiple workers
python tools/start_service.py --workers 4 --host 0.0.0.0
# Behind reverse proxy (nginx)
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
FROM python:3.12-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
EXPOSE 8000
CMD ["python", "tools/start_service.py", "--host", "0.0.0.0"]
Accept: text/event-stream
header# Enable debug logging
python tools/start_service.py --log-level debug
# Check service status
curl -X GET http://127.0.0.1:8000/health
# Monitor performance
curl -X GET http://127.0.0.1:8000/stats
We welcome contributions! Please see our Contributing Guide for details.
git checkout -b feature/amazing-feature
)git commit -m 'Add amazing feature'
)git push origin feature/amazing-feature
)# Clone and setup
git clone https://github.com/hsliuustc/nano-qwen3-serving.git
cd nano-qwen3-serving
# Install development dependencies
pip install -r requirements.txt
pip install -e .
# Run tests
python -m pytest tests/
# Run linting
python -m flake8 nano_qwen3_serving/
This project is licensed under the MIT License - see the LICENSE file for details.
Made with β€οΈ for the AI community