Using Claude API in a Swift App: Model Routing, Streaming, and Cost Control

Why Claude, and why no SDK

Second Brain is an AI memory companion for iOS. You capture thoughts, ideas, and experiences throughout the day, and the app uses AI to help you explore connections between them and generate periodic reflections on your thinking patterns. The AI is not a chatbot — it's a retrieval and synthesis layer over your personal knowledge base.

We chose Claude over GPT for a straightforward reason: Claude produces better long-form synthesis. When the task is "read 20 of this person's memories and write a thoughtful reflection about recurring themes," Claude consistently generates output that feels insightful rather than generic. The difference is especially noticeable in non-English content, where many of our users write in Chinese.

The more interesting decision was skipping the official Anthropic SDK entirely. The anthropic-sdk-swift package exists, and it's well-maintained. But for a production iOS app, we had specific concerns:

Binary size. Every dependency adds weight. On iOS, users notice download size. Our entire app is under 15 MB — adding a full SDK with its transitive dependencies would be visible.
URLSession already streams. The main reason you'd want an SDK is to handle Server-Sent Events (SSE) streaming. But since iOS 15, URLSession.bytes(for:) gives you an AsyncBytes sequence that's perfect for SSE parsing. The hard part is already solved by the platform.
Surface area. We use exactly one Anthropic endpoint: /v1/messages. We don't need function calling, tool use, batching, or any of the other capabilities the SDK abstracts. A minimal client is simpler to audit, debug, and maintain.

The result: our entire Claude integration is about 80 lines of Swift. No SPM dependency, no version conflicts, no wondering what the SDK is doing under the hood.

The minimal URLSession client

Here's the core of our Claude client. It's a simple struct that wraps the Messages API with proper error handling and type-safe request/response models:

struct ClaudeClient {
    let apiKey: String
    private let endpoint = URL(string: "https://api.anthropic.com/v1/messages")!
    private let apiVersion = "2023-06-01"
    private let encoder = JSONEncoder()
    private let decoder = JSONDecoder()

    struct Message: Codable {
        let role: String
        let content: String
    }

    struct Request: Codable {
        let model: String
        let max_tokens: Int
        let system: String?
        let messages: [Message]
        let temperature: Double?
        let stream: Bool?
    }

    struct Response: Codable {
        struct Content: Codable { let text: String }
        struct Usage: Codable {
            let input_tokens: Int
            let output_tokens: Int
        }
        let content: [Content]
        let usage: Usage
    }

    func sendMessage(
        model: String,
        system: String? = nil,
        messages: [Message],
        maxTokens: Int = 1024,
        temperature: Double? = nil
    ) async throws -> Response {
        var request = URLRequest(url: endpoint)
        request.httpMethod = "POST"
        request.setValue(apiKey, forHTTPHeaderField: "x-api-key")
        request.setValue(apiVersion, forHTTPHeaderField: "anthropic-version")
        request.setValue("application/json", forHTTPHeaderField: "content-type")

        let body = Request(
            model: model,
            max_tokens: maxTokens,
            system: system,
            messages: messages,
            temperature: temperature,
            stream: false
        )
        request.httpBody = try encoder.encode(body)

        let (data, response) = try await URLSession.shared.data(for: request)
        guard let http = response as? HTTPURLResponse else {
            throw ClaudeError.invalidResponse
        }
        guard http.statusCode == 200 else {
            throw ClaudeError.apiError(
                status: http.statusCode,
                body: String(data: data, encoding: .utf8) ?? ""
            )
        }
        return try decoder.decode(Response.self, from: data)
    }
}

enum ClaudeError: Error {
    case invalidResponse
    case apiError(status: Int, body: String)
}

That's the entire non-streaming client. Three headers (x-api-key, anthropic-version, content-type), a JSON body, and standard Codable encoding. The JSONEncoder and JSONDecoder handle the snake_case mapping automatically if you set keyDecodingStrategy, but since our models already match the API's naming, we don't bother.

A few things worth noting:

The API key goes in a header, not a query parameter. This matters because query parameters show up in server logs. Anthropic uses x-api-key as the header name — not Authorization: Bearer, which trips up developers coming from OpenAI.
The anthropic-version header is required. Pin it to a specific date. This protects you from breaking changes — the API behavior is frozen for that version string.
Error handling is critical. The API returns structured JSON errors with a type and message field. In production, we parse these and handle rate limits (429) with exponential backoff, and overload errors (529) with a user-facing "busy" message.

Model routing: Haiku vs Sonnet

Not every AI feature needs the most powerful model. This sounds obvious, but it's tempting to just use the best model for everything and optimize later. We started with model routing from day one, and it saved us real money.

Second Brain has two AI features that use the Claude API:

Explore — the user asks a question about their memories, and the app retrieves relevant entries and generates a concise answer. This runs on Claude Haiku (claude-haiku-4-5-20251001). It needs to be fast (under 2 seconds) and cheap (users might run 10+ explores per day).
Reflections — the app periodically synthesizes the user's recent memories into a thoughtful reflection about patterns, growth, and connections. This runs on Claude Sonnet (claude-sonnet-4-5-20250514). Speed is less critical (it runs in the background), and the quality difference justifies the cost.

The router is a simple enum:

enum AIFeature {
    case explore
    case reflection

    var model: String {
        switch self {
        case .explore:    return "claude-haiku-4-5-20251001"
        case .reflection: return "claude-sonnet-4-5-20250514"
        }
    }

    var maxTokens: Int {
        switch self {
        case .explore:    return 512
        case .reflection: return 2048
        }
    }

    var temperature: Double {
        switch self {
        case .explore:    return 0.3  // factual retrieval
        case .reflection: return 0.7  // creative synthesis
        }
    }
}

The cost difference is dramatic. At current Anthropic pricing, Haiku is roughly 3x cheaper per token than Sonnet. For a typical explore query (300 input tokens, 200 output tokens), the cost on Haiku is about $0.001. The same query on Sonnet (300 input, 500 output tokens) would cost about $0.008. That difference is negligible for one request — but across thousands of daily active users, it's the difference between a sustainable business and a cash bonfire.

On-device pre-routing with Apple Foundation Models

Starting with iOS 26, Apple Foundation Models give you on-device language model inference at zero API cost. We use this for the first routing decision: before an explore query ever hits the Claude API, on-device inference classifies the query and selects the relevant memories.

This means the Claude API call receives a pre-filtered, pre-ranked context window — fewer tokens in, better responses out, lower cost. The on-device model handles the grunt work; Claude handles the synthesis.

Design principle: Use the cheapest model that meets the quality bar for each feature. On-device inference for classification, Haiku for retrieval, Sonnet for synthesis. Each layer handles what it's best at.

Streaming on iOS

Without streaming, a Claude API call looks like this to the user: they tap "Explore," stare at a spinner for 3-8 seconds, then a wall of text appears all at once. With streaming, tokens appear in real-time as the model generates them. The perceived latency drops from seconds to milliseconds — the first token typically arrives in under 500ms.

URLSession makes this straightforward with bytes(for:), which returns an AsyncBytes sequence. The Anthropic API uses Server-Sent Events (SSE), where each event is a line prefixed with data: followed by a JSON payload. Here's our streaming parser:

func streamMessage(
    model: String,
    system: String? = nil,
    messages: [ClaudeClient.Message],
    maxTokens: Int = 1024,
    temperature: Double? = nil
) -> AsyncThrowingStream<String, Error> {
    AsyncThrowingStream { continuation in
        Task {
            do {
                var request = URLRequest(url: endpoint)
                request.httpMethod = "POST"
                request.setValue(apiKey, forHTTPHeaderField: "x-api-key")
                request.setValue(apiVersion, forHTTPHeaderField: "anthropic-version")
                request.setValue("application/json", forHTTPHeaderField: "content-type")

                let body = ClaudeClient.Request(
                    model: model,
                    max_tokens: maxTokens,
                    system: system,
                    messages: messages,
                    temperature: temperature,
                    stream: true
                )
                request.httpBody = try encoder.encode(body)

                let (bytes, response) = try await URLSession.shared.bytes(for: request)
                guard let http = response as? HTTPURLResponse,
                      http.statusCode == 200 else {
                    throw ClaudeError.apiError(status: 0, body: "Stream failed")
                }

                for try await line in bytes.lines {
                    guard line.hasPrefix("data: ") else { continue }
                    let json = String(line.dropFirst(6))

                    if let data = json.data(using: .utf8),
                       let event = try? decoder.decode(StreamEvent.self, from: data) {
                        switch event.type {
                        case "content_block_delta":
                            if let text = event.delta?.text {
                                continuation.yield(text)
                            }
                        case "message_stop":
                            break
                        case "error":
                            throw ClaudeError.apiError(
                                status: 0,
                                body: event.error?.message ?? "Unknown stream error"
                            )
                        default:
                            continue
                        }
                    }
                }
                continuation.finish()
            } catch {
                continuation.finish(throwing: error)
            }
        }
    }
}

struct StreamEvent: Codable {
    let type: String
    let delta: Delta?
    let error: StreamError?

    struct Delta: Codable { let text: String? }
    struct StreamError: Codable { let message: String }
}

The key insight: bytes.lines gives you an AsyncSequence of lines, so you naturally iterate through SSE events with a for try await loop. No manual buffer management, no callback hell.

Connecting streaming to SwiftUI

On the UI side, SwiftUI's .task modifier is ideal for streaming. It launches an async task tied to the view's lifecycle — when the view disappears, the task is automatically cancelled, which tears down the URLSession stream. No manual cancellation bookkeeping required:

struct ExploreView: View {
    @State private var responseText = ""
    @State private var isStreaming = false
    let query: String

    var body: some View {
        ScrollView {
            Text(responseText)
                .font(.body)
                .padding()
        }
        .task {
            isStreaming = true
            do {
                let stream = claude.streamMessage(
                    model: AIFeature.explore.model,
                    system: ExplorePrompt.system,
                    messages: [.init(role: "user", content: query)],
                    maxTokens: AIFeature.explore.maxTokens,
                    temperature: AIFeature.explore.temperature
                )
                for try await chunk in stream {
                    responseText += chunk
                }
            } catch {
                responseText += "\n\n[Error: \(error.localizedDescription)]"
            }
            isStreaming = false
        }
    }
}

Gotcha: If you use .task(id:) with a changing identifier, SwiftUI will cancel the previous task and start a new one each time the ID changes. This is usually what you want for explore queries, but make sure to reset responseText at the start of each new task — otherwise the old response text bleeds into the new one.

Prompt design for mobile

Mobile prompts are different from desktop or web prompts in three fundamental ways: the input is shorter, the output must be shorter, and latency tolerance is lower. Users are capturing quick thoughts on their phone, not writing essays. They're reading on a 6-inch screen, not a 27-inch monitor. And they expect near-instant responses because every other part of a native iOS app is instant.

System prompts: short and specific

Second Brain's explore system prompt is about 200 tokens. It tells the model what it is (a memory retrieval assistant), what tone to use (concise, warm, direct), and what to avoid (generic platitudes, therapy-speak, bullet-point lists). We found through testing that every additional instruction in the system prompt slightly increases latency and slightly decreases response quality — the model has more constraints to satisfy and gets less creative as a result.

The reflection system prompt is longer (~400 tokens) because the task is more complex. It specifies the structure of a good reflection: start with a thematic observation, reference specific memories as evidence, draw a connection the user might not have noticed, and end with a forward-looking thought. This additional guidance is worth the extra latency for a background task.

Context injection: less is more

When a user asks an explore question, we don't send their entire memory archive to Claude. We use on-device embedding similarity to select the 20 most relevant memories and pass only those as context. This keeps the input token count low (typically 500-800 tokens of context) and focuses the model on what matters.

The context is injected as a structured block in the user message, not as separate system prompt entries:

func buildExplorePrompt(query: String, memories: [Memory]) -> String {
    let context = memories
        .prefix(20)
        .enumerated()
        .map { "[\($0.offset + 1)] \($0.element.content) (\($0.element.createdAt.formatted(.dateTime.month().day())))" }
        .joined(separator: "\n")

    return """
    Here are relevant memories:
    \(context)

    Question: \(query)
    """
}

Response length control

For explore, we explicitly instruct the model: "Respond in 2-4 sentences." Without this constraint, Claude tends to produce thorough, well-structured responses that are excellent on desktop and overwhelming on mobile. The max_tokens parameter is a hard ceiling, but the instruction in the prompt is a soft guide that produces more natural-sounding truncation.

For reflections, we allow up to 2048 tokens and instruct the model to write "a short paragraph, roughly 150-200 words." Users read reflections at their leisure, so length is less of a concern — but we still cap it to keep the tone contemplative rather than exhaustive.

Per-user cost tracking

When every API call costs money, you need to understand your per-user economics. Not in aggregate — per user. The difference between your cheapest and most expensive user can be 100x, and your pricing model needs to account for that.

Every Claude API response includes a usage field with input_tokens and output_tokens. We capture these after every call and store daily aggregates in SwiftData:

@Model
final class APIUsageRecord {
    var date: Date = Date()
    var feature: String = ""        // "explore" or "reflection"
    var model: String = ""          // "claude-haiku-4-5-20251001"
    var inputTokens: Int = 0
    var outputTokens: Int = 0
    var estimatedCostUSD: Double = 0.0

    static func record(
        feature: AIFeature,
        usage: ClaudeClient.Response.Usage
    ) -> APIUsageRecord {
        let record = APIUsageRecord()
        record.feature = String(describing: feature)
        record.model = feature.model
        record.inputTokens = usage.input_tokens
        record.outputTokens = usage.output_tokens
        record.estimatedCostUSD = Self.estimateCost(
            model: feature.model,
            input: usage.input_tokens,
            output: usage.output_tokens
        )
        return record
    }

    private static func estimateCost(
        model: String, input: Int, output: Int
    ) -> Double {
        // Pricing as of 2026 — update when rates change
        let (inputRate, outputRate): (Double, Double) = switch model {
        case let m where m.contains("haiku"):
            (1.00 / 1_000_000, 5.00 / 1_000_000)
        case let m where m.contains("sonnet"):
            (3.00 / 1_000_000, 15.00 / 1_000_000)
        default:
            (3.00 / 1_000_000, 15.00 / 1_000_000)
        }
        return Double(input) * inputRate + Double(output) * outputRate
    }
}

All of this runs locally on the device. No server, no analytics service, no third-party SDK. The data stays in SwiftData and syncs via CloudKit if the user has iCloud enabled (as described in the previous post).

Economics in practice

Here's what the numbers look like in production:

A typical free-tier user runs 2-3 explores per day, all on Haiku. Average daily cost: ~$0.01.
A heavy Pro user runs 10+ explores per day and gets a daily reflection on Sonnet. Average daily cost: ~$0.15.
Pro pricing is $4.99/month ($0.17/day). Even the heaviest Pro users leave a healthy margin.

The free tier limits (10 captures/day, 3 explores/day) were set by working backward from acceptable per-user costs. We don't show a usage dashboard to users — there's no reason to make them think about token counts. The limits are enforced silently, and when a user hits the daily cap, they see a gentle nudge toward Pro.

Key insight: Per-user cost tracking doesn't require a server. Store the usage data locally, use it for limit enforcement and business decisions, and keep your architecture simple. You can always build a server-side analytics pipeline later if you need aggregate data — but for an indie app, local tracking is enough.

Wrapping up

The full Claude integration in Second Brain is roughly 200 lines of Swift: 80 for the client, 40 for the streaming parser, 30 for the model router, and 50 for usage tracking. No SDK, no server, no complexity that doesn't earn its keep.

The key decisions that make this work:

Pure URLSession over the SDK. Less dependency surface, smaller binary, and you understand every line. The AsyncBytes API makes SSE streaming trivial.
Route by feature, not by whim. Haiku for fast/cheap tasks, Sonnet for quality-critical tasks. The 3x cost difference makes this a business decision, not just a technical one.
On-device pre-processing. Use Apple Foundation Models for classification and embedding similarity before touching the API. Every token you don't send is money you don't spend.
Track costs locally. SwiftData stores per-call token counts. No server needed. Use the data to set sustainable free-tier limits.

This approach pairs well with the local-first architecture we covered in the previous post. The AI layer is stateless — it reads from the local SwiftData store, makes an API call, and writes the result back to SwiftData. If the network is down, the app still works; the AI features just aren't available until connectivity returns.

Next up in this series: how we ship Second Brain with zero third-party SDKs — no Firebase, no Mixpanel, no Sentry — and still get the diagnostics we need.

Next in this series

Up next: how we ship iOS apps without Firebase, Mixpanel, or Sentry — relying on Apple's built-in diagnostics, Xcode Organizer, and custom lightweight logging to get the insights we need without any third-party SDKs.

Series: Local-first + AI + Privacy on iOS

Why Claude, and why no SDK

The minimal URLSession client

Model routing: Haiku vs Sonnet

On-device pre-routing with Apple Foundation Models

Streaming on iOS

Connecting streaming to SwiftUI

Prompt design for mobile

System prompts: short and specific

Context injection: less is more

Response length control

Per-user cost tracking

Economics in practice

Wrapping up

Next in this series