Get up and running with Nano Qwen3 Serving in under 5 minutes! This guide will walk you through installation, basic setup, and your first API request.
pip install nano-qwen3-serving
git clone https://github.com/hsliuustc/nano-qwen3-serving.git
cd nano-qwen3-serving
pip install -e .
# Start with default settings (port 8000)
python -m nano_qwen3_serving
# Or specify a custom port
python -m nano_qwen3_serving --port 8001
# Start with a specific model
python -m nano_qwen3_serving --model Qwen/Qwen3-1.5B
You should see output like:
π Starting nano Qwen3 Serving Service
π Model: Qwen/Qwen3-0.6B
π§ Device: mps
π Host: 127.0.0.1
π Port: 8000
π₯ Workers: 1
π Log Level: info
--------------------------------------------------
INFO: Started server process [12345]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8000
curl http://localhost:8000/health
Expected response:
{
"status": "healthy",
"timestamp": "2024-01-15T10:30:00Z",
"version": "1.0.0"
}
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{"role": "user", "content": "Hello! How are you today?"}
],
"max_tokens": 100
}'
import requests
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "Qwen/Qwen3-0.6B",
"messages": [
{"role": "user", "content": "Hello! How are you today?"}
],
"max_tokens": 100
}
)
result = response.json()
print(result["choices"][0]["message"]["content"])
Option | Description | Default |
---|---|---|
--port |
Server port | 8000 |
--host |
Server host | 127.0.0.1 |
--model |
Model to load | Qwen/Qwen3-0.6B |
--device |
Device (mps/cpu) | mps |
--workers |
Number of workers | 1 |
--log-level |
Logging level | info |
export NANO_QWEN3_PORT=8001
export NANO_QWEN3_MODEL=Qwen/Qwen3-1.5B
export NANO_QWEN3_DEVICE=mps
export NANO_QWEN3_LOG_LEVEL=debug
Model | Parameters | Memory | Speed | Use Case |
---|---|---|---|---|
Qwen/Qwen3-0.6B |
596M | ~2GB | Fast | Development, Testing |
Qwen/Qwen3-1.5B |
1.5B | ~4GB | Medium | General Purpose |
Qwen/Qwen3-3B |
3B | ~8GB | Slower | High Quality |
Enable streaming for real-time responses:
import requests
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "Qwen/Qwen3-0.6B",
"messages": [
{"role": "user", "content": "Write a short story about a robot."}
],
"stream": True
},
stream=True
)
for line in response.iter_lines():
if line:
print(line.decode('utf-8'))
--port
optionStart with debug logging for more information:
python -m nano_qwen3_serving --log-level debug
π Congratulations! Youβve successfully set up Nano Qwen3 Serving!