nano-qwen3-serving

API Reference

Nano Qwen3 Serving provides a fully OpenAI-compatible API. All endpoints follow the OpenAI API specification, making it a drop-in replacement for OpenAI services.

๐Ÿ”— Base URL

http://localhost:8000

๐Ÿ“‹ Authentication

Currently, Nano Qwen3 Serving doesnโ€™t require authentication for local development. For production deployments, consider implementing API key authentication.

๐Ÿ“Š Endpoints Overview

Endpoint Method Description
/health GET Health check and server status
/v1/models GET List available models
/v1/chat/completions POST Generate chat completions
/v1/chat/completions POST Generate streaming completions
/stats GET Performance statistics

๐Ÿฅ Health Check

GET /health

Check if the server is running and healthy.

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "version": "1.0.0",
  "model": "Qwen/Qwen3-0.6B",
  "device": "mps",
  "uptime": 3600
}

Example:

curl http://localhost:8000/health

๐Ÿค– Models

GET /v1/models

List all available models.

Response:

{
  "object": "list",
  "data": [
    {
      "id": "Qwen/Qwen3-0.6B",
      "object": "model",
      "created": 1705312800,
      "owned_by": "nano-qwen3-serving",
      "permission": [],
      "root": "Qwen/Qwen3-0.6B",
      "parent": null
    }
  ]
}

Example:

curl http://localhost:8000/v1/models

๐Ÿ’ฌ Chat Completions

POST /v1/chat/completions

Generate chat completions with the specified model.

Request Body

Field Type Required Description
model string Yes Model identifier (e.g., โ€œQwen/Qwen3-0.6Bโ€)
messages array Yes Array of message objects
stream boolean No Whether to stream the response (default: false)
max_tokens integer No Maximum tokens to generate (default: 2048)
temperature number No Sampling temperature (0.0-2.0, default: 1.0)
top_p number No Nucleus sampling parameter (0.0-1.0, default: 1.0)
n integer No Number of completions to generate (default: 1)
stop string/array No Stop sequences
presence_penalty number No Presence penalty (-2.0 to 2.0, default: 0.0)
frequency_penalty number No Frequency penalty (-2.0 to 2.0, default: 0.0)
logit_bias object No Logit bias for specific tokens
user string No User identifier for tracking

Message Object

{
  "role": "system|user|assistant",
  "content": "message content"
}

Non-Streaming Response

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1705312800,
  "model": "Qwen/Qwen3-0.6B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I'm here to help you."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 15,
    "total_tokens": 25
  }
}

Streaming Response

When stream=true, the response is a Server-Sent Events (SSE) stream:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1705312800,"model":"Qwen/Qwen3-0.6B","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1705312800,"model":"Qwen/Qwen3-0.6B","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1705312800,"model":"Qwen/Qwen3-0.6B","choices":[{"index":0,"delta":{"content":" I'm"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1705312800,"model":"Qwen/Qwen3-0.6B","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Examples

Basic Chat Completion

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 100
  }'

Streaming Chat Completion

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
      {"role": "user", "content": "Write a short poem about AI."}
    ],
    "stream": true,
    "max_tokens": 200
  }'

Python Example

import requests

# Non-streaming
response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "Qwen/Qwen3-0.6B",
        "messages": [
            {"role": "user", "content": "Explain quantum computing in simple terms."}
        ],
        "max_tokens": 150,
        "temperature": 0.7
    }
)

result = response.json()
print(result["choices"][0]["message"]["content"])

# Streaming
response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "Qwen/Qwen3-0.6B",
        "messages": [
            {"role": "user", "content": "Write a story about a robot."}
        ],
        "stream": True,
        "max_tokens": 300
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        line = line.decode('utf-8')
        if line.startswith('data: '):
            data = line[6:]  # Remove 'data: ' prefix
            if data == '[DONE]':
                break
            try:
                chunk = json.loads(data)
                if 'choices' in chunk and chunk['choices']:
                    delta = chunk['choices'][0].get('delta', {})
                    if 'content' in delta:
                        print(delta['content'], end='', flush=True)
            except json.JSONDecodeError:
                continue

๐Ÿ“Š Performance Statistics

GET /stats

Get real-time performance statistics.

Response:

{
  "requests_processed": 150,
  "tokens_generated": 2500,
  "average_response_time": 0.045,
  "requests_per_second": 22.5,
  "memory_usage_mb": 2048,
  "gpu_utilization": 0.75,
  "model_info": {
    "name": "Qwen/Qwen3-0.6B",
    "parameters": 596049920,
    "device": "mps"
  }
}

Example:

curl http://localhost:8000/stats

โš ๏ธ Error Responses

All endpoints return standard HTTP status codes and error messages in the following format:

{
  "error": {
    "message": "Error description",
    "type": "invalid_request_error",
    "param": "model",
    "code": "model_not_found"
  }
}

Common Error Codes

Code Description
model_not_found Specified model doesnโ€™t exist
invalid_request_error Request format is invalid
rate_limit_exceeded Too many requests
server_error Internal server error

๐Ÿ”ง Rate Limiting

Currently, Nano Qwen3 Serving doesnโ€™t implement rate limiting. For production use, consider implementing rate limiting based on your requirements.

๐Ÿ“ Request/Response Headers

Request Headers

Header Description
Content-Type Must be application/json
Accept application/json for non-streaming, text/event-stream for streaming

Response Headers

Header Description
Content-Type application/json or text/event-stream
Cache-Control Caching directives
X-Request-ID Unique request identifier

๐Ÿ”„ WebSocket Support

WebSocket support is planned for future releases. Currently, use Server-Sent Events (SSE) for streaming responses.

๐Ÿ“š Next Steps