Transform

Converting content from one modality to another while preserving intent, constraints, and fidelity is a keystone feature of generative AI. Examples include audio to text, text to audio, text to image, image to text (OCR or captioning), text to video, document to slides, code to diagram, and screenshot to HTML.

Different modalities are useful for different purposes. Text is the best tool for working through outlines and narrative. Seeing tokens translated into image can reveal tone and mood, while transforming text or data into diagrams allows users to review it easily. In this way, the transform pattern serves as a creative pipeline between the user and the AI.

Where to find the action

  • Right-click menus or inline actions (Midjourney)
  • Chained together on an open canvas (FloraFauna)
  • Direct generation via open input (Krea)
  • Background helpers, such as transcription (Descript)
  • Built into the core interface (Chronicle)

Design considerations

  • Maintain a connection to the original. Show outlines, prompts, and blended sources and make them accessible for additional iteration, even if the form of the generation has changed.
  • Add stop points between modalities. As transformational actions tend to be more expensive to fulfill, put action plans, sample responses, and verifications in front of the user before committing to a broader generative run.
  • Start small and work up. Show drafts of the transformed generation before committing to a full run, like generating an outline or snippet first.

Examples

FloraFaunra focuses on seamless exploration by allowing transformations to flow seamlessly through forms as the user explores how to visualize a concept
Julius can transform data tables into charts so users can visualize and interact with the data within the tool instead of moving it elsewhere
Midjourney supports image to text transformation, using an image to generate a prompt that can be used for further generation
Midjourney’s suggest a prompt feature (previously “describe” in discord) allows users to transform image to text and back to image