This guide covers the most common issues you might encounter when using Nano Qwen3 Serving and how to resolve them.
Error:
HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: '<nano_qwen3_serving.core.llm.LLM object at 0x152f8b170>'.
Cause: The model identifier is being passed as an LLM object instead of a string.
Solution:
# โ Incorrect - passing LLM object
model_runner = ModelRunner(llm_object)
# โ
Correct - passing string identifier
model_runner = ModelRunner("Qwen/Qwen3-0.6B")
Fix in code:
# In nano_qwen3_serving/core/model_runner.py
def __init__(self, model_name: str, device: str = "mps"):
self.model_name = model_name # Should be string like "Qwen/Qwen3-0.6B"
self.device = device
self._load_model()
def _load_model(self):
# Use self.model_name (string) instead of passing LLM object
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
self.model = AutoModelForCausalLM.from_pretrained(self.model_name)
Error:
INFO: 127.0.0.1:49260 - "GET /v1/models HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:49279 - "POST /v1/chat/completions HTTP/1.1" 404 Not Found
Cause: The API routes are not properly registered or the server is not running the correct application.
Solution:
Fix:
# In nano_qwen3_serving/service/server.py
app = FastAPI(title="Nano Qwen3 Serving", version="1.0.0")
# Add all required routes
app.add_api_route("/health", health_check, methods=["GET"])
app.add_api_route("/v1/models", list_models, methods=["GET"])
app.add_api_route("/v1/chat/completions", chat_completions, methods=["POST"])
app.add_api_route("/stats", get_stats, methods=["GET"])
Error:
RuntimeError: CUDA out of memory
Solutions:
python -m nano_qwen3_serving --model Qwen/Qwen3-0.6B
# In configuration
max_batch_size = 1
python -m nano_qwen3_serving --device cpu
# Check available memory
vm_stat
# Create swap file if needed
sudo sysctl vm.swapusage
Error:
RuntimeError: MPS not available
Solutions:
pip install torch torchvision torchaudio
python -m nano_qwen3_serving --device cpu
Error:
ConnectionError: Failed to download model
Solutions:
export HUGGING_FACE_HUB_TOKEN=your_token
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir="./models")
model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir="./models")
python -m nano_qwen3_serving --model ./local/path/to/model
Error:
OSError: [Errno 48] Address already in use
Solutions:
lsof -ti:8000 | xargs kill -9
python -m nano_qwen3_serving --port 8001
lsof -i :8000
Symptoms:
Solutions:
# Reduce context length
max_context_length = 512
# Use smaller batch size
max_batch_size = 1
# Use torch.compile (PyTorch 2.0+)
model = torch.compile(model)
# Use half precision
model = model.half()
# Check CPU usage
top
# Check memory usage
vm_stat
# Check GPU usage (if available)
sudo powermetrics --samplers gpu_power -n 1
Warning:
UserWarning: Field "model_info" has conflict with protected namespace "model_".
Solution:
# In your Pydantic model
class Config:
protected_namespaces = ()
python -m nano_qwen3_serving --log-level debug
curl http://localhost:8000/health
# Follow logs in real-time
tail -f logs/nano_qwen3.log
# Search for errors
grep -i error logs/nano_qwen3.log
# Test model loading
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B")
# Test inference
inputs = tokenizer("Hello", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
curl http://localhost:8000/stats
# CPU and memory
htop
# GPU usage (if available)
sudo powermetrics --samplers gpu_power -n 1
# Network connections
netstat -an | grep 8000
import time
import requests
def benchmark_api():
start_time = time.time()
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 50
}
)
end_time = time.time()
print(f"Response time: {end_time - start_time:.3f}s")
print(f"Status code: {response.status_code}")
return response.json()
# Run benchmark
result = benchmark_api()
Always check the logs first:
tail -n 100 logs/nano_qwen3.log
Check existing issues on GitHub:
When creating an issue, include:
๐ก Pro Tip: Most issues can be resolved by checking the logs and ensuring youโre using the latest version of the package.