Hosted ondailyplanet.iovia theHypermedia Protocol

Document to Text Function

    The documentToText function converts hypermedia documents (with all their embeds) to plain text representation. It recursively resolves all inline and block embeds, replacing them with their actual content.

    Overview

      Location: @shm/shared/document-to-text

      Purpose: Generate plain text version of documents with resolved embeds

      Use Cases:

        Text fragment rendering with inline embeds resolved

        Document search indexing

        Export to plain text

        Content preview generation

    API

      Function Signature

        async function documentToText({
          documentId,
          grpcClient,
          options = {},
        }: {
          documentId: UnpackedHypermediaId
          grpcClient: GRPCClient
          options: DocumentToTextOptions
        }): Promise<string>
        

      Options

        interface DocumentToTextOptions {
          maxDepth?: number              // Maximum embed depth (default: 10)
          resolveInlineEmbeds?: boolean  // Replace inline embeds with doc names (default: true)
          lineBreaks?: boolean           // Add line breaks between blocks (default: true)
        }
        

    Features

      1. Hierarchical Block Processing

        Processes document blocks depth-first:

          Paragraphs: Extract text content

          Headings: Extract heading text

          Code blocks: Include code content

          Buttons: Extract button labels from attributes.name

          Images/Videos/Files: Include captions

          Embeds: Recursively fetch and include content

      2. Inline Embed Resolution

        Replaces invisible character markers (U+FEFF) with document names:

          Detects inline embed annotations

          Fetches referenced document

          Replaces marker with @DocumentName

        Example:

        "Check out this post!" → "Check out @Alice's Guide this post!"
        

      3. Block Embed Resolution

        Recursively fetches and includes embedded documents:

          Full document embeds

          Block-specific embeds (blockRef)

          Block range embeds (blockRef with range)

      4. Fragment Support

        Handles blockRef and blockRange:

          #blockId - Returns only that block's content

          #blockId[start:end] - Returns only children within range

          Respects parent-child relationships

      5. Safety Features

        Circular reference detection: Tracks visited documents

        Depth limiting: Prevents infinite recursion

        Error handling: Graceful fallbacks for missing content

    Cross-Platform Integration

      The function is available in both desktop and web apps through the document content context:

      Desktop App

        Direct access to grpcClient:

        const {getDocumentText} = useDocContentContext()
        
        const text = await getDocumentText(documentId, {
          lineBreaks: false,
          resolveInlineEmbeds: true,
        })
        

      Web App

        API endpoint at /hm/api/document-text:

        const {getDocumentText} = useDocContentContext()
        
        // Same API, but fetches from server
        const text = await getDocumentText(documentId, {
          maxDepth: 5,
          resolveInlineEmbeds: true,
        })
        

    Implementation Details

      Architecture

        Desktop:
          Component → useDocContentContext() → documentToText(grpcClient) → Text
        
        Web:
          Component → useDocContentContext() → API /hm/api/document-text → documentToText(grpcClient) → Text
        

      Key Files

        frontend/packages/shared/src/document-to-text.ts - Core implementation

        frontend/packages/shared/src/document-content-types.ts - Context interface

        frontend/apps/desktop/src/pages/document-content-provider.tsx - Desktop provider

        frontend/apps/web/app/doc-content-provider.tsx - Web provider

        frontend/apps/web/app/routes/hm.api.document-text.tsx - Web API endpoint

    Usage Examples

      Basic Usage

        import {documentToText, hmId} from '@shm/shared'
        import {grpcClient} from './grpc-client'
        
        const documentId = hmId('account123', {path: ['my-doc']})
        const text = await documentToText({
          documentId,
          grpcClient,
          options: {},
        })
        console.log(text)
        

      With Options

        // Compact text without line breaks
        const compactText = await documentToText({
          documentId,
          grpcClient,
          options: {
            lineBreaks: false,
            maxDepth: 5,
            resolveInlineEmbeds: true,
          },
        })
        
        // Without inline embed resolution (keep original text)
        const rawText = await documentToText({
          documentId,
          grpcClient,
          options: {
            resolveInlineEmbeds: false,
          },
        })
        

      In React Components

        function MyComponent({docId}: {docId: UnpackedHypermediaId}) {
          const {getDocumentText} = useDocContentContext()
          const [text, setText] = useState('')
        
          useEffect(() => {
            getDocumentText?.(docId, {lineBreaks: false})
              .then(setText)
              .catch(console.error)
          }, [docId, getDocumentText])
        
          return <pre>{text}</pre>
        }
        

    Testing

      17 comprehensive tests covering:

        Basic text extraction

        Inline embed resolution

        Block embed processing

        Nested structures

        Circular reference detection

        Max depth handling

        BlockRef/BlockRange fragments

        Button and heading extraction

        LineBreaks option

      Run tests:

      NODE_ENV=test yarn workspace @shm/shared test run document-to-text
      

    Performance Considerations

      Caching: Consider caching results for frequently accessed documents

      Depth limiting: Use maxDepth option for large document trees

      Inline embeds: Disabling resolveInlineEmbeds improves performance

      Async: Function is async and may take time for deep embed trees

    Related Documentation

      Text Fragment Rendering

      Document Blocks

      Document Linking