Advanced Chat Patterns
The foundation models is not meant to be used as a chatbot because of its limited context window and training. But, if you do want to have an assistant in your app with tools that can access your app’s data, you would want to provide the best experience without the context window exploding and gracefully handling the errors that inevitably crop up.
This chapter mainly covers managing conversation memory and handling context limit with examples.
Prerequisites and Context
This chapter builds on all previous Foundation Models concepts - particularly the session patterns from earlier chapters, streaming concepts, and structured generation patterns. You should understand how sessions maintain conversation state, how to handle basic responses, and how streaming works. This chapter also references tool calling patterns, which are covered in detail in the chapter on basic tool use.
What You Will Learn
By the end of this chapter, you will be able to:
- Build UIs directly from transcript entries for natural conversation flow
- Estimate and accurately measure token usage to avoid context window limits
- Implement sliding window context management for indefinite conversations
- Handle conversation persistence and restoration across app sessions
- Build responsive streaming chat interfaces with proper state management
- Create conversation summaries to maintain context while reducing token usage
- Integrate user feedback systems for continuous model improvement
Working with Conversation Memory
The transcript is where Foundation Models stores your entire conversation. You let the framework handle everything automatically but should understand how transcripts work to build anything beyond basic demos.
Understanding the Transcript Structure
Think of a transcript as the conversation’s source of truth. Every interaction gets broken down into structured entries that the model uses to maintain context:
- Instructions: Your system prompt and tool definitions
- Prompts: What users actually type or say
- Responses: What the model generates back
- Tool Calls: When the model decides to use your custom tools
- Tool Output: The results from those tool executions
The transcript is a RandomAccessCollection, which means you can iterate over it, slice it, and access entries by index. It gives you access to the conversation structure that the framework uses internally.
Building UI from Transcript Entries
When building chat interfaces, you typically want to display user messages and assistant responses. Since Transcript.Entry is an enum, you can pattern match on the different types to build your UI. Here is how I handle this in my chat apps:
struct TranscriptEntryView: View {
let entry: Transcript.Entry
var body: some View {
switch entry {
case .instructions(let instructions):
SystemMessageView(instructions: instructions)
case .prompt(let prompt):
UserMessageView(prompt: prompt)
case .response(let response):
AssistantMessageView(response: response)
case .toolCalls(let toolCalls):
ToolCallView(calls: toolCalls)
case .toolOutput(let output):
ToolOutputView(output: output)
@unknown default:
EmptyView() // Gracefully handle future types
}
}
} Each entry type contains different data you can use in your UI:
- Instructions: Contains
segments(text or structured content) plustoolDefinitions - Prompts & Responses: Contain
segmentsfor the actual content - Tool Calls: Include tool names and structured arguments
- Tool Output: Contains execution results in
segments
This pattern lets you build rich chat interfaces that show not just the conversation, but also what’s happening behind the scenes with tools.
Managing the Token Budget
The context window is the model’s biggest constraint for chat apps. By default, the context window is 4096 tokens shared between input and output, though starting in iOS 26.4, you can query the actual size dynamically:
let model = SystemLanguageModel.default
let contextSize = try await model.contextSize The contextSize property is back-deployed to iOS 26.0. On systems before iOS 26.4, it returns 4096. On iOS 26.4 and later, it fetches the real value from the model, which may grow as Apple ships updated models with future OS releases. Use this property instead of hardcoding 4096 throughout your app.
Early on, I learned that you cannot just count characters but need to estimate tokens properly to avoid hitting limits unexpectedly. Here is an example of how I built token counting into my chat system.
extension Transcript.Entry {
var estimatedTokenCount: Int {
switch self {
case .instructions(let instructions):
return instructions.segments.reduce(0) { $0 + $1.estimatedTokenCount }
case .prompt(let prompt):
return prompt.segments.reduce(0) { $0 + $1.estimatedTokenCount }
case .response(let response):
return response.segments.reduce(0) { $0 + $1.estimatedTokenCount }
case .toolCalls(let toolCalls):
// Tool calls are structured, add overhead
return toolCalls.reduce(0) { total, call in
total + estimateTokensAdvanced(call.toolName) +
estimateTokensForStructuredContent(call.arguments) + 5 // Call overhead
}
case .toolOutput(let output):
return output.segments.reduce(0) { $0 + $1.estimatedTokenCount } + 3 // Output overhead
}
}
} You can use the Transcript.Segment to estimate the token count of a segment:
extension Transcript.Segment {
var estimatedTokenCount: Int {
switch self {
case .text(let textSegment):
return estimateTokensAdvanced(textSegment.content)
case .structure(let structuredSegment):
return estimateTokensForStructuredContent(structuredSegment.content)
}
}
} You can also use the Transcript to estimate the token count of the entire transcript:
extension Transcript {
var estimatedTokenCount: Int {
return self.reduce(0) { $0 + $1.estimatedTokenCount }
}
/// Returns the estimated token count with a larger safety buffer
var safeEstimatedTokenCount: Int {
// Add bigger buffer to account for underestimation
let baseTokens = estimatedTokenCount
let buffer = Int(Double(baseTokens) * 0.25) // 25% buffer
let systemOverhead = 100 // Fixed overhead for system tokens
return baseTokens + buffer + systemOverhead
}
/// Checks if the transcript is approaching the token limit (earlier trigger)
func isApproachingLimit(threshold: Double = 0.70, maxTokens: Int) -> Bool {
let currentTokens = safeEstimatedTokenCount
let limitThreshold = Int(Double(maxTokens) * threshold)
return currentTokens > limitThreshold
}
/// Returns a subset of entries that fit within the token budget
func entriesWithinTokenBudget(_ budget: Int) async -> [Transcript.Entry] {
var result: [Transcript.Entry] = []
// Always include instructions first if they exist
if let instructions = self.first(where: {
if case .instructions(_) = $0 { return true }
return false
}) {
result.append(instructions)
}
// Add other entries from newest to oldest until budget is reached
let nonInstructionEntries = self.filter { entry in
if case .instructions(_) = entry { return false }
return true
}
for entry in nonInstructionEntries.reversed() {
let candidateEntries = result + [entry]
let candidateTranscript = Transcript(entries: candidateEntries)
let candidateTokens = await currentTokenCount(for: candidateTranscript)
if candidateTokens > budget { break }
result = candidateEntries
}
return result
}
} /// Estimates token count using Apple's guidance: 4 characters per token
func estimateTokensAdvanced(_ text: String) -> Int {
guard !text.isEmpty else { return 0 }
let characterCount = text.count
// Simple: 4 characters per token across all content types
let tokensPerChar = 1.0 / 4.0
return max(1, Int(ceil(Double(characterCount) * tokensPerChar)))
} /// Estimates token count for structured JSON content
func estimateTokensForStructuredContent(_ content: GeneratedContent) -> Int {
let jsonString = content.jsonString
let characterCount = jsonString.count
// Use same 4 chars per token for JSON
let tokensPerChar = 1.0 / 4.0
return max(1, Int(ceil(Double(characterCount) * tokensPerChar)))
} These extensions use Apple’s guidance of 3-4 characters per token to estimate usage. I preferred using 4 to be on the conservative side. The safeEstimatedTokenCount adds a 25% buffer because underestimating tokens is worse than overestimating as you would rather trigger context management early than hit the hard limit. Cursor performs a similar optimization for their Grok Code model by summarising around 75% of the 256K token context window.
The entriesWithinTokenBudget method will be useful for sliding window implementations later discussed in this chapter. It helps you keep the most recent conversation parts within a specific token budget.
Accurate Token Counting with the Token Usage API
The estimation approach above works well enough for iOS 26.0 through 26.3, but it is still guesswork. Starting in iOS 26.4, Apple introduced the TokenUsage API on SystemLanguageModel, which gives you model-accurate token counts instead of approximations. The model itself counts the tokens, so the numbers are exact.
The API provides three overloads, each targeting a different part of your session context:
let model = SystemLanguageModel.default
let instructionTokens = try await model.tokenUsage(
for: Instructions("You are a helpful fitness coach."),
tools: [SearchTool(), HealthDataTool()]
)
print("Instructions + tools: (instructionTokens.tokenCount) tokens") This first overload measures how many tokens your instructions and tool definitions consume. Tool definitions take up context space because the model needs their names, descriptions, and argument schemas to decide when to call them. Knowing this number upfront lets you budget the remaining context for actual conversation.
let promptTokens = try await model.tokenUsage(
for: Prompt("How did my workouts look this week?")
)
print("Prompt: (promptTokens.tokenCount) tokens") The second overload accepts anything conforming to PromptRepresentable. Use it to check whether a user’s message will fit before sending it to the session.
let transcriptTokens = try await model.tokenUsage(
for: session.transcript.map { $0 }
)
print("Transcript: (transcriptTokens.tokenCount) tokens") The third overload takes a collection of Transcript.Entry values. This is the one you will reach for most often in chat apps, since it tells you exactly how much of the context window the conversation has consumed so far.
Here is a version-aware helper that uses the accurate API on iOS 26.4 and falls back to the estimation heuristic on earlier systems:
func currentTokenCount(for transcript: Transcript) async -> Int {
if #available(iOS 26.4, *) {
let model = SystemLanguageModel.default
if let usage = try? await model.tokenUsage(
for: transcript.map { $0 }
) {
return usage.tokenCount
}
}
return transcript.safeEstimatedTokenCount
} In my experience, the estimation heuristic tends to undershoot by 10-20% on conversations that mix structured content with plain text. The Token Usage API removes that uncertainty entirely. If your deployment target is iOS 26.4, you can drop the estimation extensions and rely on the API directly. If you need to support earlier versions, keep both paths and prefer the accurate one when available.
Transcript Persistence and Restoration
Foundation Models transcripts are Codable, allowing you to save and restore conversation state:
class ConversationPersistence {
private let documentsDirectory = FileManager.default.urls(
for: .documentDirectory,
in: .userDomainMask
).first!
func saveTranscript(_ transcript: Transcript, withID id: String) throws {
let url = transcriptURL(for: id)
let data = try JSONEncoder().encode(transcript)
try data.write(to: url)
}
func loadTranscript(withID id: String) throws -> Transcript {
let url = transcriptURL(for: id)
let data = try Data(contentsOf: url)
return try JSONDecoder().decode(Transcript.self, from: data)
}
func deleteTranscript(withID id: String) throws {
let url = transcriptURL(for: id)
try FileManager.default.removeItem(at: url)
}
func listSavedTranscripts() throws -> [String] {
let urls = try FileManager.default.contentsOfDirectory(
at: documentsDirectory,
includingPropertiesForKeys: nil
)
return urls
.filter { $0.pathExtension == "transcript" }
.map { $0.deletingPathExtension().lastPathComponent }
}
private func transcriptURL(for id: String) -> URL {
documentsDirectory.appendingPathComponent("(id).transcript")
}
}
// Usage
class PersistentChatSession: ObservableObject {
@Published var session: LanguageModelSession
private let persistence = ConversationPersistence()
private let sessionID: String
init(sessionID: String) {
self.sessionID = sessionID
// Try to restore existing session
if let transcript = try? persistence.loadTranscript(withID: sessionID) {
self.session = LanguageModelSession(transcript: transcript)
} else {
self.session = LanguageModelSession()
}
}
func saveCurrentState() {
do {
try persistence.saveTranscript(session.transcript, withID: sessionID)
} catch {
print("Failed to save transcript: (error)")
}
}
deinit {
saveCurrentState()
}
} Using Session Transcript
Instead of keeping track of the conversation history yourself, you can directly use transcript to get the actual conversation structure that Foundation Models uses internally.
struct TranscriptBasedChatView: View {
@State private var session: LanguageModelSession?
@State private var currentInput = ""
@State private var isProcessing = false
var body: some View {
VStack {
ScrollViewReader { proxy in
ScrollView {
LazyVStack(spacing: 12) {
ForEach(session?.transcript ?? .init()) { entry in
TranscriptEntryView(entry: entry)
.id(entry.id)
}
}
.padding()
}
.onChange(of: session?.transcript.count ?? 0) { _, _ in
if let lastEntry = session?.transcript.last {
withAnimation(.easeOut(duration: 0.3)) {
proxy.scrollTo(lastEntry.id, anchor: .bottom)
}
}
}
}
HStack {
TextField("Type your message...", text: $currentInput)
.textFieldStyle(RoundedBorderTextFieldStyle())
.onSubmit { Task { await sendMessage() } }
Button("Send") {
Task { await sendMessage() }
}
.disabled(currentInput.isEmpty || isProcessing)
}
.padding()
}
.task { await setupSession() }
}
private func setupSession() async {
// Session setup covered in earlier chapters
guard SystemLanguageModel.default.availability == .available else { return }
session = LanguageModelSession(instructions: Instructions("You are a helpful assistant."))
}
private func sendMessage() async {
guard let session = session, !currentInput.isEmpty else { return }
let prompt = currentInput
currentInput = ""
isProcessing = true
do {
// Streaming provides immediate feedback instead of waiting for complete response
let responseStream = session.streamResponse(to: Prompt(prompt))
for try await _ in responseStream {
// Foundation Models handles transcript updates automatically during streaming
}
} catch {
// Production apps should handle context window and guardrail violations gracefully
}
isProcessing = false
}
}
struct TranscriptEntryView: View {
let entry: Transcript.Entry
var body: some View {
switch entry {
case .prompt(let prompt):
if let text = extractText(from: prompt.segments), !text.isEmpty {
ChatBubble(content: text, isFromUser: true)
}
case .response(let response):
if let text = extractText(from: response.segments), !text.isEmpty {
ChatBubble(content: text, isFromUser: false)
}
case .instructions:
// Instructions are system-level and not part of user conversation flow
EmptyView()
@unknown default:
EmptyView()
}
}
private func extractText(from segments: [Transcript.Segment]) -> String? {
let text = segments.compactMap { segment in
if case .text(let textSegment) = segment {
return textSegment.content
}
return nil
}.joined(separator: " ")
return text.isEmpty ? nil : text
}
} Streaming Responses
Streaming allows you to see the response as it is being generated, instead of waiting for the entire response to be generated. You do not get individual words or characters but snapshots of the response as it builds up. The model is extremely fast to fetch the first token of the response, so take advantage of this feature as much as you can.
Foundation Models has an interesting take to streaming. Instead of individual words or characters, you get snapshots. These are complete but partial responses that get more detailed as the model generates content. Each snapshot is a valid structure with more fields populated than the previous one.
This approach is different from other AI frameworks, but it is actually better for UI development. You do not have to accumulate deltas yourself or worry about parsing incomplete responses.
enum StreamingState {
case idle
case streaming(response: String)
case completed(response: String)
case error(message: String)
var currentResponse: String {
switch self {
case .idle: return ""
case .streaming(let response): return response
case .completed(let response): return response
case .error(let message): return "Error: (message)"
}
}
var isStreaming: Bool {
if case .streaming = self { return true }
return false
}
var isCompleted: Bool {
if case .completed = self { return true }
return false
}
var errorMessage: String? {
if case .error(let message) = self { return message }
return nil
}
}
struct StreamingChatView: View {
@State private var streamingState: StreamingState = .idle
func streamResponse(to prompt: String) async {
streamingState = .streaming(response: "")
do {
let session = LanguageModelSession()
let stream = session.streamResponse(to: prompt)
for try await snapshot in stream {
await MainActor.run {
if case .streaming = streamingState {
streamingState = .streaming(response: snapshot.content)
}
}
}
// Optional: Get final result with metadata
let finalResponse = try await stream.collect()
await MainActor.run {
streamingState = .completed(response: finalResponse.content)
}
} catch {
await MainActor.run {
streamingState = .error(message: error.localizedDescription)
}
}
}
var body: some View {
VStack {
ScrollView {
Text(streamingState.currentResponse)
.textSelection(.enabled)
.padding()
.foregroundColor(streamingState.errorMessage != nil ? .red : .primary)
}
switch streamingState {
case .idle:
EmptyView()
case .streaming:
HStack {
ProgressView()
.scaleEffect(0.8)
Text("Generating response...")
.font(.caption)
.foregroundColor(.secondary)
}
.padding()
case .completed:
HStack {
Image(systemName: "checkmark.circle.fill")
.foregroundColor(.green)
Text("Response complete")
.font(.caption)
.foregroundColor(.secondary)
}
.padding()
case .error:
HStack {
Image(systemName: "exclamationmark.triangle.fill")
.foregroundColor(.red)
Text("Failed to generate response")
.font(.caption)
.foregroundColor(.secondary)
}
.padding()
}
}
}
} Sliding Window Context Management
The model’s context window is finite. During longer conversations, you will eventually hit this limit and need to manage it. Simply clearing the conversation loses all context, so you need better approaches to maintain conversation flow while staying within the token budget.
The sliding window approach is to keep the most recent conversation parts within a specific token budget. This is done by summarizing the conversation and creating a new session with the summarized context.
@Observable
final class ChatBotService {
private(set) var session: LanguageModelSession
var isSummarizing: Bool = false
var isApplyingWindow: Bool = false
var sessionCount: Int = 1
// Sliding Window Configuration
private let windowThreshold = 0.75 // Start windowing at 75%
private let targetWindowRatio = 0.50 // Keep 50% of context after windowing
init() {
self.session = LanguageModelSession(
instructions: Instructions("You are a helpful, friendly AI assistant.")
)
}
@MainActor
func sendMessage(_ content: String) async {
do {
if await shouldApplyWindow() {
await applySlidingWindow()
}
let responseStream = session.streamResponse(to: Prompt(content))
for try await _ in responseStream {
// Framework handles transcript synchronization during streaming
}
} catch LanguageModelSession.GenerationError.exceededContextWindowSize {
await handleContextWindowExceeded(userMessage: content)
} catch {
await handleGenerationError(error, userMessage: content)
}
}
// MARK: - Sliding Window Implementation
private func shouldApplyWindow() async -> Bool {
let maxTokens = (try? await SystemLanguageModel.default.contextSize) ?? 4096
let currentTokens = await currentTokenCount(for: session.transcript)
let limitThreshold = Int(Double(maxTokens) * windowThreshold)
return currentTokens > limitThreshold
}
@MainActor
private func applySlidingWindow() async {
isApplyingWindow = true
let currentTokens = await currentTokenCount(for: session.transcript)
debugPrint("Applying sliding window - Current tokens: (currentTokens)")
let maxTokens = (try? await SystemLanguageModel.default.contextSize) ?? 4096
let targetWindowSize = Int(Double(maxTokens) * targetWindowRatio)
let windowEntries = await session.transcript.entriesWithinTokenBudget(targetWindowSize)
let windowedTranscript = Transcript(entries: windowEntries)
session = LanguageModelSession(transcript: windowedTranscript)
sessionCount += 1
let newTokens = await currentTokenCount(for: windowedTranscript)
debugPrint("Sliding window applied - Reduced to: (newTokens) tokens ((windowEntries.count) entries)")
isApplyingWindow = false
}
@MainActor
private func handleContextWindowExceeded(userMessage: String) async {
isSummarizing = true
do {
let summary = try await generateConversationSummary()
createNewSessionWithContext(summary: summary)
isSummarizing = false
// Continue conversation with summarized context
try await respondWithNewSession(to: userMessage)
} catch {
// Fallback to manual conversation restart if summarization fails
isSummarizing = false
}
}
private func generateConversationSummary() async throws -> ConversationSummary {
let summarySession = LanguageModelSession(
instructions: Instructions("""
You are an expert at summarizing conversations.
Create thorough summaries that preserve all important context.
""")
)
let conversationText = createConversationText()
let summaryPrompt = """
Please summarize this entire conversation comprehensively.
Include all key points, topics discussed, user preferences,
and important context:
(conversationText)
"""
let summaryResponse = try await summarySession.respond(
to: Prompt(summaryPrompt),
generating: ConversationSummary.self
)
return summaryResponse.content
}
private func createConversationText() -> String {
return session.transcript.compactMap { entry in
switch entry {
case .prompt(let prompt):
let text = extractTextFromSegments(prompt.segments)
return "User: (text)"
case .response(let response):
let text = extractTextFromSegments(response.segments)
return "Assistant: (text)"
case .toolCalls(let toolCalls):
let calls = toolCalls.map { "($0.toolName)(($0.arguments.jsonString))"
}.joined(separator: ", ")
return "Tool Calls: (calls)"
case .toolOutput(let output):
let text = extractTextFromSegments(output.segments)
return "Tool Output: (text)"
default:
return nil
}
}.joined(separator: "\n\n")
}
private func extractTextFromSegments(_ segments: [Transcript.Segment]) -> String {
return segments.compactMap { segment in
if case .text(let textSegment) = segment {
return textSegment.content
}
return nil
}.joined(separator: " ")
}
private func createNewSessionWithContext(summary: ConversationSummary) {
let contextInstructions = """
You are a helpful, friendly AI assistant. You are continuing a conversation.
Here is a summary of your previous conversation:
CONVERSATION SUMMARY:
(summary.summary)
KEY TOPICS DISCUSSED:
(summary.keyTopics.map { "• ($0)" }.joined(separator: "\n"))
USER PREFERENCES/REQUESTS:
(summary.userPreferences.map { "• ($0)" }.joined(separator: "\n"))
Continue the conversation naturally, referencing this context when relevant.
"""
session = LanguageModelSession(instructions: contextInstructions)
sessionCount += 1
}
}
// Support model for conversation summaries
@Generable
struct ConversationSummary {
@Guide(description: "A complete summary of the entire conversation")
let summary: String
@Guide(description: "The main topics or themes that were discussed")
let keyTopics: [String]
@Guide(description: "Any specific requests or preferences the user mentioned")
let userPreferences: [String]
} This allows the user to have an experience where conversations can continue indefinitely without users ever seeing “conversation too long” errors!
Learning from Users
Foundation Models provides a built-in feedback system that helps you understand how well your AI responses are performing. The framework handles structuring feedback data so you can focus on collecting the feedback from your users.
The Feedback API
The main method of the feedback system is the logFeedbackAttachment() method, which creates structured feedback that can be submitted to Apple:
class ChatManager {
private var session = LanguageModelSession()
func provideFeedback(sentiment: LanguageModelFeedback.Sentiment, issues: [LanguageModelFeedback.Issue] = []) {
// Generate structured feedback attachment
let feedbackData = session.logFeedbackAttachment(
sentiment: sentiment,
issues: issues,
desiredOutput: nil // Optional: show what the response should have been
)
// The feedbackData contains the full conversation context and feedback
storeFeedbackLocally(feedbackData)
}
private func storeFeedbackLocally(_ data: Data) {
// Save for your own analytics or submit to Apple via Feedback Assistant
// The data includes the full transcript and structured feedback
}
} Feedback Types
The framework provides some structured feedback categories that help you understand specific issues:
// Simple sentiment feedback
let positiveFeedback = LanguageModelFeedback.Sentiment.positive
let negativeFeedback = LanguageModelFeedback.Sentiment.negative
let neutralFeedback = LanguageModelFeedback.Sentiment.neutral // Detailed issue reporting for negative feedback
let issues = [
LanguageModelFeedback.Issue(
category: .incorrect,
explanation: "The model will not accept there is iOS 26 after iOS 18"
),
LanguageModelFeedback.Issue(
category: .didNotFollowInstructions,
explanation: "Asked for 10 years of experience in SwiftUI but the model said no"
)
]
// Submit feedback
session.logFeedbackAttachment(
sentiment: .negative,
issues: issues
) Issue Categories
The framework also provides predefined issue categories that cover common problems:
- .unhelpful - Response does not address the user’s need
- .incorrect - Contains factual errors or misinformation
- .tooVerbose - Unnecessarily long or repetitive
- .didNotFollowInstructions - Ignored specific user constraints
- .stereotypeOrBias - Contains harmful stereotypes or bias
- .suggestiveOrSexual - Inappropriate sexual content
- .vulgarOrOffensive - Offensive language or content
- .triggeredGuardrailUnexpectedly - Safety measures activated inappropriately
The one thing to to take away from this is that logFeedbackAttachment() handles all the complexity of packaging your conversation context into a format that Apple can use to improve the models. You focus on when and how to collect the feedback from users, while the framework handles the technical details.
What’s Next
The patterns in this chapter provide a solid foundation (pun intended again, ha) for building chat UI with Foundation Models. Start with the streaming chat view and token management system, then add conversation persistence and error handling! The next chapter explores safety and best practices for implementing responsible AI features with proper guardrails and user protection.