Star LLM — Admin

Integration guide

Star LLM exposes internal AI models (chat, transcription, rerank) through one gateway. Access is via an MCP token issued by the platform admin — ask for one scoped to the tools you need. Flat internal cost: there is no per-call billing, only per-token rate limits.

1. Authentication

Every request carries your token as a bearer header. Treat it like a password: server-side only, in your secret manager, never in client code or git.

Authorization: Bearer slm_xxxxxxxxxxxxxxxxxxxx

2. MCP endpoint (primary)

JSON-RPC 2.0 over a single POST — the same shape as Star Drive's /api/mcp. Discover what your token can do:

curl -X POST https://api.llm.star.sa/api/mcp \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'

tools/list returns only the tools your token is scoped for. Call one:

curl -X POST https://api.llm.star.sa/api/mcp \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":2,"method":"tools/call","params":{
        "name":"chat.complete",
        "arguments":{"messages":[{"role":"user","content":"Summarize: ..."}],
                     "max_tokens":300,"temperature":0.3}}}'

3. Tools

Tool	Arguments	Returns
`chat.complete`	`messages` (role/content array), `max_tokens?`, `temperature?`	`content`, `usage`, `model`, and `clamped` if policy adjusted your params
`rerank`	`query`, `candidates` ([{id, text}]), `top_k?`	`ranked` ([{id, score}], best first)
`request_audio_upload`	`filename`, `content_type?`	`upload_url` (signed PUT, 15 min), `gcs_uri`
`transcribe_audio`	`gcs_uri` (from upload step), `language?` (e.g. "ar")	`text`, `language`, `duration_s`, `segments`

3b. Models currently serving (live)

Model	Kind	How to use
`Qwen/Qwen2.5-7B-Instruct-AWQ`	llm	pass as `model` in chat.complete
`BAAI/bge-reranker-v2-m3`	reranker	rerank (optionally pass as `model`)
`faster-whisper large-v3`	stt	transcribe_audio

Your token may be restricted to a subset — tools/list is always the authoritative view for you.

4. Transcription flow (3 steps)

1. tools/call request_audio_upload {"filename":"meeting.wav","content_type":"audio/wav"}
2. HTTP PUT your audio bytes to upload_url (same Content-Type)
3. tools/call transcribe_audio {"gcs_uri": "...", "language": "ar"}   # omit language = auto

Your upload is deleted server-side immediately after a successful transcription. Tokens can only transcribe their own uploads.

5. REST mirrors (if you prefer plain REST)

Endpoint	Equivalent tool
`POST https://api.llm.star.sa/v1/chat`	chat.complete
`POST https://api.llm.star.sa/v1/rerank`	rerank
`POST https://api.llm.star.sa/v1/uploads`	request_audio_upload
`POST https://api.llm.star.sa/v1/transcribe`	transcribe_audio

Same bearer token, same JSON bodies as the tool arguments.

6. Policies — what to expect

Clamping, not rejection: if you ask for temperature=1.5 but your policy caps at 0.7, the call runs at 0.7 and the response includes "clamped": {"temperature": 0.7}. Same for max_tokens.
Rate limits return 429 with a retry-after header (seconds) and JSON retry_after_s. Respect it — denied calls don't consume quota.
Long calls are fine: transcription connections are held up to 15 minutes. Set your client timeout above your longest audio, not below.

7. Errors

Status	Meaning	What to do
401	missing / unknown / revoked / expired token	check the header; ask admin to reissue
403	token not scoped for that tool	ask admin to add the scope
429	rate / daily / concurrency limit	back off per retry-after
502	model server error	retry once; report if persistent
503	GPU VM down (gateway healthy)	retry with backoff

8. Node.js example

async function mcpCall(name, args) {
  const res = await fetch("https://api.llm.star.sa/api/mcp", {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.STAR_LLM_TOKEN}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ jsonrpc: "2.0", id: 1, method: "tools/call",
                           params: { name, arguments: args } }),
    // transcribe can take minutes — do NOT use default fetch timeouts
    signal: AbortSignal.timeout(15 * 60_000),
  });
  if (res.status === 429) throw new Error(`rate limited, retry after ${res.headers.get("retry-after")}s`);
  const body = await res.json();
  if (body.error) throw new Error(body.error.message ?? JSON.stringify(body.error));
  return body.result.structuredContent;
}

9. Python example

import os, httpx

def mcp_call(name: str, args: dict):
    r = httpx.post(
        "https://api.llm.star.sa/api/mcp",
        headers={"Authorization": f"Bearer {os.environ['STAR_LLM_TOKEN']}"},
        json={"jsonrpc": "2.0", "id": 1, "method": "tools/call",
              "params": {"name": name, "arguments": args}},
        timeout=httpx.Timeout(900.0, connect=10.0),  # long audio is normal
    )
    if r.status_code == 429:
        raise RuntimeError(f"rate limited, retry after {r.headers['retry-after']}s")
    body = r.json()
    if "error" in body:
        raise RuntimeError(body["error"])
    return body["result"]["structuredContent"]

Need a token, higher limits, or another tool scope? Contact the platform admin — these are dashboard toggles, not deployments.