API Documentation

The Answira API is fully compatible with the OpenAI API format. You can use any OpenAI SDK or library by simply changing the base URL.

Introduction

Our API provides access to GLM-4.7, a state-of-the-art reasoning model with extended thinking capabilities. The API supports:

Chat completions with streaming
Function/tool calling
Reasoning output (thinking process)
JSON mode and structured outputs
Prompt caching for reduced input costs
Long context (up to 131K tokens)
Request cancellation

Authentication

Authenticate requests using an API key in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Base URL

https://answira.ai/api

All API endpoints are relative to this base URL.

Quick Start

Python

cURL

JavaScript

from openai import OpenAI

client = OpenAI(
    base_url="https://answira.ai/api/v1",
    api_key="your-api-key"
)

response = client.chat.completions.create(
    model="zai-org/GLM-4.7-FP8",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
    max_tokens=500
)

print(response.choices[0].message.content)

curl https://answira.ai/api/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-4.7-FP8",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 500
  }'

import OpenAI from 'openai';

const client = new OpenAI({
    baseURL: 'https://answira.ai/api/v1',
    apiKey: 'your-api-key'
});

const response = await client.chat.completions.create({
    model: 'zai-org/GLM-4.7-FP8',
    messages: [{role: 'user', content: 'Hello!'}],
    max_tokens: 500
});

console.log(response.choices[0].message.content);

List Models

GET /v1/models

Returns a list of available models.

Response

{
  "data": [{
    "id": "zai-org/GLM-4.7-FP8",
    "name": "GLM-4.7 FP8",
    "created": 1737100800,
    "context_length": 131072,
    "max_completion_tokens": 131072,
    "input_modalities": ["text"],
    "output_modalities": ["text"],
    "quantization": "fp8",
    "pricing": {
      "prompt": "0.000000475",
      "completion": "0.000002",
      "input_cache_read": "0.00000008"
    },
    "supported_sampling_parameters": [
      "temperature", "top_p", "top_k", "frequency_penalty",
      "presence_penalty", "repetition_penalty", "stop", "seed"
    ],
    "supported_features": ["tools", "json_mode", "structured_outputs", "reasoning"],
    "datacenters": [{"country_code": "CZ"}]
  }]
}

Chat Completions

POST /v1/chat/completions

Creates a chat completion for the given messages.

Request Body

Parameter	Type	Description
`model` Required	string	Model ID to use: `zai-org/GLM-4.7-FP8`
`messages` Required	array	Array of message objects with `role` and `content`
`max_tokens` Optional	integer	Maximum tokens to generate
`temperature` Optional	number	Sampling temperature (0-2). Default: 0.7
`top_p` Optional	number	Nucleus sampling (0-1). Default: 1.0
`top_k` Optional	integer	Top-K sampling. Limits to K most likely tokens
`stop` Optional	string/array	Stop sequence(s). Up to 4 sequences
`seed` Optional	integer	Seed for reproducible outputs
`frequency_penalty` Optional	number	Penalize frequent tokens (-2.0 to 2.0)
`presence_penalty` Optional	number	Penalize repeated tokens (-2.0 to 2.0)
`repetition_penalty` Optional	number	Alternative repetition control (>1.0 = less repetition)
`stream` Optional	boolean	Enable SSE streaming. Default: false
`tools` Optional	array	List of tools/functions the model can call
`tool_choice` Optional	string/object	`"auto"`, `"none"`, or specific function
`response_format` Optional	object	`{"type": "json_object"}` or `{"type": "json_schema", ...}`

Example Request

{
  "model": "zai-org/GLM-4.7-FP8",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "max_tokens": 500,
  "temperature": 0.7
}

Completions (Legacy)

POST /v1/completions

Legacy text completions endpoint. For new integrations, use Chat Completions instead.

Request Body

Parameter	Type	Description
`model` Required	string	Model ID: `zai-org/GLM-4.7-FP8`
`prompt` Required	string/array	The prompt(s) to generate completions for
`max_tokens` Optional	integer	Maximum tokens to generate
`temperature` Optional	number	Sampling temperature (0-2). Default: 0.7
`top_p` Optional	number	Nucleus sampling (0-1). Default: 1.0
`stop` Optional	string/array	Stop sequence(s)
`stream` Optional	boolean	Enable SSE streaming

Example Request

{
  "model": "zai-org/GLM-4.7-FP8",
  "prompt": "Once upon a time",
  "max_tokens": 200,
  "temperature": 0.7
}

Streaming

Enable streaming by setting stream: true. The response will be sent as Server-Sent Events (SSE).

from openai import OpenAI

client = OpenAI(
    base_url="https://answira.ai/api/v1",
    api_key="your-api-key"
)

stream = client.chat.completions.create(
    model="zai-org/GLM-4.7-FP8",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Usage in final chunk: Usage information (prompt tokens, completion tokens) is included in the final streaming chunk when stream_options.include_usage is set.

SSE Keep-alive: The server sends SSE comment keep-alive messages (: keepalive) during long processing. Your client should ignore SSE comments per the SSE specification.

Tool Calling

The model supports function/tool calling in OpenAI-compatible format.

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"}
            },
            "required": ["city"]
        }
    }
}]

response = client.chat.completions.create(
    model="zai-org/GLM-4.7-FP8",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto"
)

Reasoning Output

GLM-4.7 is a reasoning model that shows its thinking process. The reasoning is returned in the reasoning_content field.

Note: The model first outputs reasoning tokens, then the final answer. Both count towards your token usage.

JSON Mode

Set response_format: {"type": "json_object"} to constrain the model to output valid JSON.

response = client.chat.completions.create(
    model="zai-org/GLM-4.7-FP8",
    messages=[
        {"role": "system", "content": "Respond in JSON format."},
        {"role": "user", "content": "List 3 European capitals with their countries."}
    ],
    response_format={"type": "json_object"}
)

Tip: When using JSON mode, include an instruction in your system or user message asking the model to respond in JSON. This helps the model produce well-structured output.

Structured Outputs

Use JSON Schema to constrain the output to a specific structure. This guarantees the response matches your schema.

response = client.chat.completions.create(
    model="zai-org/GLM-4.7-FP8",
    messages=[
        {"role": "user", "content": "Extract: John is 30 years old and lives in Prague."}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person_info",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                    "city": {"type": "string"}
                },
                "required": ["name", "age", "city"]
            }
        }
    }
)

Prompt Caching

Repeated prompt prefixes may be served from cache at a reduced price. Cached input tokens cost $0.08/M compared to $0.475/M for regular input tokens.

The usage object in the response includes prompt_tokens_details.cached_tokens showing how many tokens were served from cache:

// In the response usage object:
{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 200,
    "total_tokens": 1700,
    "prompt_tokens_details": {
      "cached_tokens": 1200
    }
  }
}

Note: Caching is automatic. When you send requests with the same prompt prefix, subsequent requests may benefit from cached tokens at the lower rate.

Request Cancellation

You can abort a running request at any time by closing the HTTP connection. The server detects the disconnection and immediately stops generation on the backend, freeing GPU resources.

You are only billed for tokens actually generated before the cancellation.

Error Handling

The API returns standard HTTP error codes:

Code	Description
`400`	Bad Request - Invalid parameters
`401`	Unauthorized - Invalid or missing API key
`429`	Too Many Requests - Rate limit exceeded or server busy
`500`	Internal Server Error
`503`	Service Unavailable - Backend is down

Rate Limits

Rate limits are applied based on server capacity. When the server is at high load, requests may receive a 429 response with a Retry-After header.

Adaptive Throttling: Our system uses adaptive throttling based on current load. During peak times, implement exponential backoff in your client.