API Documentation
The Answira API is fully compatible with the OpenAI API format. You can use any OpenAI SDK or library by simply changing the base URL.
Introduction
Our API provides access to GLM-4.7, a state-of-the-art reasoning model with extended thinking capabilities. The API supports:
- Chat completions with streaming
- Function/tool calling
- Reasoning output (thinking process)
- JSON mode and structured outputs
- Prompt caching for reduced input costs
- Long context (up to 131K tokens)
- Request cancellation
Authentication
Authenticate requests using an API key in the Authorization header:
Authorization: Bearer YOUR_API_KEY
Sign up at the Answira Portal to get your API key.
Base URL
https://answira.ai/api
All API endpoints are relative to this base URL.
Quick Start
from openai import OpenAI client = OpenAI( base_url="https://answira.ai/api/v1", api_key="your-api-key" ) response = client.chat.completions.create( model="zai-org/GLM-4.7-FP8", messages=[ {"role": "user", "content": "Hello!"} ], max_tokens=500 ) print(response.choices[0].message.content)
curl https://answira.ai/api/v1/chat/completions \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "zai-org/GLM-4.7-FP8", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 500 }'
import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'https://answira.ai/api/v1', apiKey: 'your-api-key' }); const response = await client.chat.completions.create({ model: 'zai-org/GLM-4.7-FP8', messages: [{role: 'user', content: 'Hello!'}], max_tokens: 500 }); console.log(response.choices[0].message.content);
List Models
Returns a list of available models.
Response
{
"data": [{
"id": "zai-org/GLM-4.7-FP8",
"name": "GLM-4.7 FP8",
"created": 1737100800,
"context_length": 131072,
"max_completion_tokens": 131072,
"input_modalities": ["text"],
"output_modalities": ["text"],
"quantization": "fp8",
"pricing": {
"prompt": "0.000000475",
"completion": "0.000002",
"input_cache_read": "0.00000008"
},
"supported_sampling_parameters": [
"temperature", "top_p", "top_k", "frequency_penalty",
"presence_penalty", "repetition_penalty", "stop", "seed"
],
"supported_features": ["tools", "json_mode", "structured_outputs", "reasoning"],
"datacenters": [{"country_code": "CZ"}]
}]
}
Chat Completions
Creates a chat completion for the given messages.
Request Body
| Parameter | Type | Description |
|---|---|---|
model Required |
string | Model ID to use: zai-org/GLM-4.7-FP8 |
messages Required |
array | Array of message objects with role and content |
max_tokens Optional |
integer | Maximum tokens to generate |
temperature Optional |
number | Sampling temperature (0-2). Default: 0.7 |
top_p Optional |
number | Nucleus sampling (0-1). Default: 1.0 |
top_k Optional |
integer | Top-K sampling. Limits to K most likely tokens |
stop Optional |
string/array | Stop sequence(s). Up to 4 sequences |
seed Optional |
integer | Seed for reproducible outputs |
frequency_penalty Optional |
number | Penalize frequent tokens (-2.0 to 2.0) |
presence_penalty Optional |
number | Penalize repeated tokens (-2.0 to 2.0) |
repetition_penalty Optional |
number | Alternative repetition control (>1.0 = less repetition) |
stream Optional |
boolean | Enable SSE streaming. Default: false |
tools Optional |
array | List of tools/functions the model can call |
tool_choice Optional |
string/object | "auto", "none", or specific function |
response_format Optional |
object | {"type": "json_object"} or {"type": "json_schema", ...} |
Example Request
{
"model": "zai-org/GLM-4.7-FP8",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 500,
"temperature": 0.7
}
Completions (Legacy)
Legacy text completions endpoint. For new integrations, use Chat Completions instead.
Request Body
| Parameter | Type | Description |
|---|---|---|
model Required |
string | Model ID: zai-org/GLM-4.7-FP8 |
prompt Required |
string/array | The prompt(s) to generate completions for |
max_tokens Optional |
integer | Maximum tokens to generate |
temperature Optional |
number | Sampling temperature (0-2). Default: 0.7 |
top_p Optional |
number | Nucleus sampling (0-1). Default: 1.0 |
stop Optional |
string/array | Stop sequence(s) |
stream Optional |
boolean | Enable SSE streaming |
Example Request
{
"model": "zai-org/GLM-4.7-FP8",
"prompt": "Once upon a time",
"max_tokens": 200,
"temperature": 0.7
}
Streaming
Enable streaming by setting stream: true. The response will be sent as Server-Sent Events (SSE).
from openai import OpenAI client = OpenAI( base_url="https://answira.ai/api/v1", api_key="your-api-key" ) stream = client.chat.completions.create( model="zai-org/GLM-4.7-FP8", messages=[{"role": "user", "content": "Write a story"}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="")
stream_options.include_usage is set.
: keepalive) during long processing. Your client should ignore SSE comments per the SSE specification.
Tool Calling
The model supports function/tool calling in OpenAI-compatible format.
tools = [{ "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a city", "parameters": { "type": "object", "properties": { "city": {"type": "string"} }, "required": ["city"] } } }] response = client.chat.completions.create( model="zai-org/GLM-4.7-FP8", messages=[{"role": "user", "content": "What's the weather in Tokyo?"}], tools=tools, tool_choice="auto" )
Reasoning Output
GLM-4.7 is a reasoning model that shows its thinking process. The reasoning is returned in the reasoning_content field.
JSON Mode
Set response_format: {"type": "json_object"} to constrain the model to output valid JSON.
response = client.chat.completions.create(
model="zai-org/GLM-4.7-FP8",
messages=[
{"role": "system", "content": "Respond in JSON format."},
{"role": "user", "content": "List 3 European capitals with their countries."}
],
response_format={"type": "json_object"}
)
Structured Outputs
Use JSON Schema to constrain the output to a specific structure. This guarantees the response matches your schema.
response = client.chat.completions.create(
model="zai-org/GLM-4.7-FP8",
messages=[
{"role": "user", "content": "Extract: John is 30 years old and lives in Prague."}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "person_info",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"city": {"type": "string"}
},
"required": ["name", "age", "city"]
}
}
}
)
Prompt Caching
Repeated prompt prefixes may be served from cache at a reduced price. Cached input tokens cost $0.08/M compared to $0.475/M for regular input tokens.
The usage object in the response includes prompt_tokens_details.cached_tokens showing how many tokens were served from cache:
// In the response usage object: { "usage": { "prompt_tokens": 1500, "completion_tokens": 200, "total_tokens": 1700, "prompt_tokens_details": { "cached_tokens": 1200 } } }
Request Cancellation
You can abort a running request at any time by closing the HTTP connection. The server detects the disconnection and immediately stops generation on the backend, freeing GPU resources.
You are only billed for tokens actually generated before the cancellation.
Error Handling
The API returns standard HTTP error codes:
| Code | Description |
|---|---|
400 | Bad Request - Invalid parameters |
401 | Unauthorized - Invalid or missing API key |
429 | Too Many Requests - Rate limit exceeded or server busy |
500 | Internal Server Error |
503 | Service Unavailable - Backend is down |
Rate Limits
Rate limits are applied based on server capacity. When the server is at high load, requests may receive a 429 response with a Retry-After header.