The documentToText function converts hypermedia documents (with all their embeds) to plain text representation. It recursively resolves all inline and block embeds, replacing them with their actual content.
Overview
Location: @shm/shared/document-to-text
Purpose: Generate plain text version of documents with resolved embeds
Use Cases:
Text fragment rendering with inline embeds resolved
Document search indexing
Export to plain text
Content preview generation
API
Function Signature
async function documentToText({
documentId,
grpcClient,
options = {},
}: {
documentId: UnpackedHypermediaId
grpcClient: GRPCClient
options: DocumentToTextOptions
}): Promise<string>
Options
interface DocumentToTextOptions {
maxDepth?: number // Maximum embed depth (default: 10)
resolveInlineEmbeds?: boolean // Replace inline embeds with doc names (default: true)
lineBreaks?: boolean // Add line breaks between blocks (default: true)
}
Features
1. Hierarchical Block Processing
Processes document blocks depth-first:
Paragraphs: Extract text content
Headings: Extract heading text
Code blocks: Include code content
Buttons: Extract button labels from attributes.name
Images/Videos/Files: Include captions
Embeds: Recursively fetch and include content
2. Inline Embed Resolution
Replaces invisible character markers (U+FEFF) with document names:
Detects inline embed annotations
Fetches referenced document
Replaces marker with @DocumentName
Example:
"Check out this post!" → "Check out @Alice's Guide this post!"
3. Block Embed Resolution
Recursively fetches and includes embedded documents:
Full document embeds
Block-specific embeds (blockRef)
Block range embeds (blockRef with range)
4. Fragment Support
Handles blockRef and blockRange:
#blockId - Returns only that block's content
#blockId[start:end] - Returns only children within range
Respects parent-child relationships
5. Safety Features
Circular reference detection: Tracks visited documents
Depth limiting: Prevents infinite recursion
Error handling: Graceful fallbacks for missing content
Cross-Platform Integration
The function is available in both desktop and web apps through the document content context:
Desktop App
Direct access to grpcClient:
const {getDocumentText} = useDocContentContext()
const text = await getDocumentText(documentId, {
lineBreaks: false,
resolveInlineEmbeds: true,
})
Web App
API endpoint at /hm/api/document-text:
const {getDocumentText} = useDocContentContext()
// Same API, but fetches from server
const text = await getDocumentText(documentId, {
maxDepth: 5,
resolveInlineEmbeds: true,
})
Implementation Details
Architecture
Desktop:
Component → useDocContentContext() → documentToText(grpcClient) → Text
Web:
Component → useDocContentContext() → API /hm/api/document-text → documentToText(grpcClient) → Text
Key Files
frontend/packages/shared/src/document-to-text.ts - Core implementation
frontend/packages/shared/src/document-content-types.ts - Context interface
frontend/apps/desktop/src/pages/document-content-provider.tsx - Desktop provider
frontend/apps/web/app/doc-content-provider.tsx - Web provider
frontend/apps/web/app/routes/hm.api.document-text.tsx - Web API endpoint
Usage Examples
Basic Usage
import {documentToText, hmId} from '@shm/shared'
import {grpcClient} from './grpc-client'
const documentId = hmId('account123', {path: ['my-doc']})
const text = await documentToText({
documentId,
grpcClient,
options: {},
})
console.log(text)
With Options
// Compact text without line breaks
const compactText = await documentToText({
documentId,
grpcClient,
options: {
lineBreaks: false,
maxDepth: 5,
resolveInlineEmbeds: true,
},
})
// Without inline embed resolution (keep original text)
const rawText = await documentToText({
documentId,
grpcClient,
options: {
resolveInlineEmbeds: false,
},
})
In React Components
function MyComponent({docId}: {docId: UnpackedHypermediaId}) {
const {getDocumentText} = useDocContentContext()
const [text, setText] = useState('')
useEffect(() => {
getDocumentText?.(docId, {lineBreaks: false})
.then(setText)
.catch(console.error)
}, [docId, getDocumentText])
return <pre>{text}</pre>
}
Testing
17 comprehensive tests covering:
Basic text extraction
Inline embed resolution
Block embed processing
Nested structures
Circular reference detection
Max depth handling
BlockRef/BlockRange fragments
Button and heading extraction
LineBreaks option
Run tests:
NODE_ENV=test yarn workspace @shm/shared test run document-to-text
Performance Considerations
Caching: Consider caching results for frequently accessed documents
Depth limiting: Use maxDepth option for large document trees
Inline embeds: Disabling resolveInlineEmbeds improves performance
Async: Function is async and may take time for deep embed trees
Related Documentation
Text Fragment Rendering
Document Blocks
Document Linking