API Documentation

The Answira API is fully compatible with the OpenAI API format. You can use any OpenAI SDK or library by simply changing the base URL.

Introduction

Our API provides access to GLM-4.7, a state-of-the-art reasoning model with extended thinking capabilities. The API supports:

Authentication

Authenticate requests using an API key in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Sign up at the Answira Portal to get your API key.

Base URL

https://answira.ai/api

All API endpoints are relative to this base URL.

Quick Start

Python
cURL
JavaScript
from openai import OpenAI

client = OpenAI(
    base_url="https://answira.ai/api/v1",
    api_key="your-api-key"
)

response = client.chat.completions.create(
    model="zai-org/GLM-4.7-FP8",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
    max_tokens=500
)

print(response.choices[0].message.content)
curl https://answira.ai/api/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-4.7-FP8",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 500
  }'
import OpenAI from 'openai';

const client = new OpenAI({
    baseURL: 'https://answira.ai/api/v1',
    apiKey: 'your-api-key'
});

const response = await client.chat.completions.create({
    model: 'zai-org/GLM-4.7-FP8',
    messages: [{role: 'user', content: 'Hello!'}],
    max_tokens: 500
});

console.log(response.choices[0].message.content);

List Models

GET /v1/models

Returns a list of available models.

Response

{
  "data": [{
    "id": "zai-org/GLM-4.7-FP8",
    "name": "GLM-4.7 FP8",
    "created": 1737100800,
    "context_length": 131072,
    "max_completion_tokens": 131072,
    "input_modalities": ["text"],
    "output_modalities": ["text"],
    "quantization": "fp8",
    "pricing": {
      "prompt": "0.000000475",
      "completion": "0.000002",
      "input_cache_read": "0.00000008"
    },
    "supported_sampling_parameters": [
      "temperature", "top_p", "top_k", "frequency_penalty",
      "presence_penalty", "repetition_penalty", "stop", "seed"
    ],
    "supported_features": ["tools", "json_mode", "structured_outputs", "reasoning"],
    "datacenters": [{"country_code": "CZ"}]
  }]
}

Chat Completions

POST /v1/chat/completions

Creates a chat completion for the given messages.

Request Body

Parameter Type Description
model Required string Model ID to use: zai-org/GLM-4.7-FP8
messages Required array Array of message objects with role and content
max_tokens Optional integer Maximum tokens to generate
temperature Optional number Sampling temperature (0-2). Default: 0.7
top_p Optional number Nucleus sampling (0-1). Default: 1.0
top_k Optional integer Top-K sampling. Limits to K most likely tokens
stop Optional string/array Stop sequence(s). Up to 4 sequences
seed Optional integer Seed for reproducible outputs
frequency_penalty Optional number Penalize frequent tokens (-2.0 to 2.0)
presence_penalty Optional number Penalize repeated tokens (-2.0 to 2.0)
repetition_penalty Optional number Alternative repetition control (>1.0 = less repetition)
stream Optional boolean Enable SSE streaming. Default: false
tools Optional array List of tools/functions the model can call
tool_choice Optional string/object "auto", "none", or specific function
response_format Optional object {"type": "json_object"} or {"type": "json_schema", ...}

Example Request

{
  "model": "zai-org/GLM-4.7-FP8",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "max_tokens": 500,
  "temperature": 0.7
}

Completions (Legacy)

POST /v1/completions

Legacy text completions endpoint. For new integrations, use Chat Completions instead.

Request Body

Parameter Type Description
model Required string Model ID: zai-org/GLM-4.7-FP8
prompt Required string/array The prompt(s) to generate completions for
max_tokens Optional integer Maximum tokens to generate
temperature Optional number Sampling temperature (0-2). Default: 0.7
top_p Optional number Nucleus sampling (0-1). Default: 1.0
stop Optional string/array Stop sequence(s)
stream Optional boolean Enable SSE streaming

Example Request

{
  "model": "zai-org/GLM-4.7-FP8",
  "prompt": "Once upon a time",
  "max_tokens": 200,
  "temperature": 0.7
}

Streaming

Enable streaming by setting stream: true. The response will be sent as Server-Sent Events (SSE).

from openai import OpenAI

client = OpenAI(
    base_url="https://answira.ai/api/v1",
    api_key="your-api-key"
)

stream = client.chat.completions.create(
    model="zai-org/GLM-4.7-FP8",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
Usage in final chunk: Usage information (prompt tokens, completion tokens) is included in the final streaming chunk when stream_options.include_usage is set.
SSE Keep-alive: The server sends SSE comment keep-alive messages (: keepalive) during long processing. Your client should ignore SSE comments per the SSE specification.

Tool Calling

The model supports function/tool calling in OpenAI-compatible format.

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"}
            },
            "required": ["city"]
        }
    }
}]

response = client.chat.completions.create(
    model="zai-org/GLM-4.7-FP8",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto"
)

Reasoning Output

GLM-4.7 is a reasoning model that shows its thinking process. The reasoning is returned in the reasoning_content field.

Note: The model first outputs reasoning tokens, then the final answer. Both count towards your token usage.

JSON Mode

Set response_format: {"type": "json_object"} to constrain the model to output valid JSON.

response = client.chat.completions.create(
    model="zai-org/GLM-4.7-FP8",
    messages=[
        {"role": "system", "content": "Respond in JSON format."},
        {"role": "user", "content": "List 3 European capitals with their countries."}
    ],
    response_format={"type": "json_object"}
)
Tip: When using JSON mode, include an instruction in your system or user message asking the model to respond in JSON. This helps the model produce well-structured output.

Structured Outputs

Use JSON Schema to constrain the output to a specific structure. This guarantees the response matches your schema.

response = client.chat.completions.create(
    model="zai-org/GLM-4.7-FP8",
    messages=[
        {"role": "user", "content": "Extract: John is 30 years old and lives in Prague."}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person_info",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                    "city": {"type": "string"}
                },
                "required": ["name", "age", "city"]
            }
        }
    }
)

Prompt Caching

Repeated prompt prefixes may be served from cache at a reduced price. Cached input tokens cost $0.08/M compared to $0.475/M for regular input tokens.

The usage object in the response includes prompt_tokens_details.cached_tokens showing how many tokens were served from cache:

// In the response usage object:
{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 200,
    "total_tokens": 1700,
    "prompt_tokens_details": {
      "cached_tokens": 1200
    }
  }
}
Note: Caching is automatic. When you send requests with the same prompt prefix, subsequent requests may benefit from cached tokens at the lower rate.

Request Cancellation

You can abort a running request at any time by closing the HTTP connection. The server detects the disconnection and immediately stops generation on the backend, freeing GPU resources.

You are only billed for tokens actually generated before the cancellation.

Error Handling

The API returns standard HTTP error codes:

CodeDescription
400Bad Request - Invalid parameters
401Unauthorized - Invalid or missing API key
429Too Many Requests - Rate limit exceeded or server busy
500Internal Server Error
503Service Unavailable - Backend is down

Rate Limits

Rate limits are applied based on server capacity. When the server is at high load, requests may receive a 429 response with a Retry-After header.

Adaptive Throttling: Our system uses adaptive throttling based on current load. During peak times, implement exponential backoff in your client.