Document to Text Function

The documentToText function converts hypermedia documents (with all their embeds) to plain text representation. It recursively resolves all inline and block embeds, replacing them with their actual content.

Overview

Location: @shm/shared/document-to-text

Purpose: Generate plain text version of documents with resolved embeds

Use Cases:

Text fragment rendering with inline embeds resolved

Document search indexing

Export to plain text

Content preview generation

API

Function Signature

async function documentToText({
  documentId,
  grpcClient,
  options = {},
}: {
  documentId: UnpackedHypermediaId
  grpcClient: GRPCClient
  options: DocumentToTextOptions
}): Promise<string>

Options

interface DocumentToTextOptions {
  maxDepth?: number              // Maximum embed depth (default: 10)
  resolveInlineEmbeds?: boolean  // Replace inline embeds with doc names (default: true)
  lineBreaks?: boolean           // Add line breaks between blocks (default: true)
}

Features

1. Hierarchical Block Processing

Processes document blocks depth-first:

Paragraphs: Extract text content

Headings: Extract heading text

Code blocks: Include code content

Buttons: Extract button labels from attributes.name

Images/Videos/Files: Include captions

Embeds: Recursively fetch and include content

2. Inline Embed Resolution

Replaces invisible character markers (U+FEFF) with document names:

Detects inline embed annotations

Fetches referenced document

Replaces marker with @DocumentName

Example:

"Check out this post!" → "Check out @Alice's Guide this post!"

3. Block Embed Resolution

Recursively fetches and includes embedded documents:

Full document embeds

Block-specific embeds (blockRef)

Block range embeds (blockRef with range)

4. Fragment Support

Handles blockRef and blockRange:

#blockId - Returns only that block's content

#blockId[start:end] - Returns only children within range

Respects parent-child relationships

5. Safety Features

Circular reference detection: Tracks visited documents

Depth limiting: Prevents infinite recursion

Error handling: Graceful fallbacks for missing content

Cross-Platform Integration

The function is available in both desktop and web apps through the document content context:

Desktop App

Direct access to grpcClient:

const {getDocumentText} = useDocContentContext()

const text = await getDocumentText(documentId, {
  lineBreaks: false,
  resolveInlineEmbeds: true,
})

Web App

API endpoint at /hm/api/document-text:

const {getDocumentText} = useDocContentContext()

// Same API, but fetches from server
const text = await getDocumentText(documentId, {
  maxDepth: 5,
  resolveInlineEmbeds: true,
})

Implementation Details

Architecture

Desktop:
  Component → useDocContentContext() → documentToText(grpcClient) → Text

Web:
  Component → useDocContentContext() → API /hm/api/document-text → documentToText(grpcClient) → Text

Key Files

frontend/packages/shared/src/document-to-text.ts - Core implementation

frontend/packages/shared/src/document-content-types.ts - Context interface

frontend/apps/desktop/src/pages/document-content-provider.tsx - Desktop provider

frontend/apps/web/app/doc-content-provider.tsx - Web provider

frontend/apps/web/app/routes/hm.api.document-text.tsx - Web API endpoint

Usage Examples

Basic Usage

import {documentToText, hmId} from '@shm/shared'
import {grpcClient} from './grpc-client'

const documentId = hmId('account123', {path: ['my-doc']})
const text = await documentToText({
  documentId,
  grpcClient,
  options: {},
})
console.log(text)

With Options

// Compact text without line breaks
const compactText = await documentToText({
  documentId,
  grpcClient,
  options: {
    lineBreaks: false,
    maxDepth: 5,
    resolveInlineEmbeds: true,
  },
})

// Without inline embed resolution (keep original text)
const rawText = await documentToText({
  documentId,
  grpcClient,
  options: {
    resolveInlineEmbeds: false,
  },
})

In React Components

function MyComponent({docId}: {docId: UnpackedHypermediaId}) {
  const {getDocumentText} = useDocContentContext()
  const [text, setText] = useState('')

  useEffect(() => {
    getDocumentText?.(docId, {lineBreaks: false})
      .then(setText)
      .catch(console.error)
  }, [docId, getDocumentText])

  return <pre>{text}</pre>
}

Testing

17 comprehensive tests covering:

Basic text extraction

Inline embed resolution

Block embed processing

Nested structures

Circular reference detection

Max depth handling

BlockRef/BlockRange fragments

Button and heading extraction

LineBreaks option

Run tests:

NODE_ENV=test yarn workspace @shm/shared test run document-to-text

Performance Considerations

Caching: Consider caching results for frequently accessed documents

Depth limiting: Use maxDepth option for large document trees

Inline embeds: Disabling resolveInlineEmbeds improves performance

Async: Function is async and may take time for deep embed trees