Retrieval Pipeline

Rosetta retrieval has two stages: ingest and search. Before an agent generates a response, Rosetta pulls supporting passages from your library and the literature, then sends the retrieved passages with the question.

Ingest

1. Add A Source

Sources come from uploads, pasted text, account sync, or PubMed search.

2. Normalize The Payload

Rosetta normalizes the source into a RagSource record.

3. Extract Text When Needed

PDF uploads are parsed in the browser with pdfjs-dist. If Rosetta cannot extract usable text, the source is still stored, but it is marked so it does not enter normal text ingest.

4. Preserve Scope

Each source is treated as either:

local
account

Account sources can be merged into the current folder when the user is signed in.

Search

Search runs in this order:

collect eligible text
embed sources
embed the query
score matches
return top sources
build the cited prompt

Retrieval Defaults

The semantic search flow:

chunks source documents into passages
embeds text content with text-embedding-004
uses cosine similarity for matching
returns a small result set
builds the prompt with [Source N] labels

Retrieval Controls

Source filters: limit retrieval to a specific library, such as local, account, or public sources.
Date windows: limit to a time range, for example guidelines from the last five years.
Relevance thresholds: drop chunks below a minimum similarity score so weak matches do not dilute the context.

PubMed Path

PubMed follows a parallel path:

Rosetta searches PubMed
loads summaries and abstracts
converts articles into RAG sources
lets the user add the selected articles to the active context