Output Formats

This guide covers Transmog’s output format options, including JSON, CSV, and Parquet, with guidance on choosing the right format for different use cases.

Supported Formats

Transmog supports three output formats:

Format

Use Case

Advantages

Considerations

JSON

APIs, web apps, document storage

Human-readable, preserves structure

Larger file size

CSV

Spreadsheets, databases, analytics

Wide compatibility, compact

Flat structure only

Parquet

Big data, analytics, data lakes

Columnar, compressed, fast queries

Requires specialized tools

JSON Output

Basic JSON Output

import transmog as tm

data = {
    "product": {
        "name": "Laptop",
        "price": 999.99,
        "reviews": [
            {"rating": 5, "comment": "Excellent"},
            {"rating": 4, "comment": "Good value"}
        ]
    }
}

result = tm.flatten(data, name="products")

# Save as JSON files (default format)
result.save("output")
# Creates:
# output/products.json (main table)
# output/products_reviews.json (reviews table)

JSON File Structure

Each table becomes a separate JSON file:

products.json:

[
  {
    "product_name": "Laptop",
    "product_price": "999.99",
    "_id": "generated_id"
  }
]

products_reviews.json:

[
  {
    "rating": "5",
    "comment": "Excellent",
    "_parent_id": "generated_id"
  },
  {
    "rating": "4",
    "comment": "Good value",
    "_parent_id": "generated_id"
  }
]

JSON Advantages

  • Human-readable: Easy to inspect and debug

  • Web-compatible: Direct use in web applications

  • Structure preservation: Maintains complex data types

  • Universal support: Works with all programming languages

JSON Best Practices

# For web APIs
result = tm.flatten(
    data,
    name="api_data",
    preserve_types=True,     # Keep numbers as numbers
    skip_null=False,         # Include null values for completeness
    arrays="separate"        # Enable relationship analysis
)
result.save("api_output", output_format="json")

# For document storage
result = tm.flatten(
    data,
    name="documents",
    arrays="inline",         # Keep arrays as JSON
    preserve_types=True,     # Maintain type information
    add_timestamp=True       # Add processing metadata
)
result.save("document_store", output_format="json")

CSV Output

Basic CSV Output

# Save as CSV files
result.save("output", output_format="csv")
# Creates:
# output/products.csv (main table)
# output/products_reviews.csv (reviews table)

CSV File Structure

Each table becomes a CSV file with headers:

products.csv:

product_name,product_price,_id
Laptop,999.99,generated_id

products_reviews.csv:

rating,comment,_parent_id
5,Excellent,generated_id
4,Good value,generated_id

CSV Advantages

  • Universal compatibility: Opens in Excel, databases, analytics tools

  • Compact size: Efficient storage for large datasets

  • Fast processing: Quick to read and write

  • Database-friendly: Direct import into relational databases

CSV Considerations

  • Flat structure only: Cannot represent nested data

  • String-based: All values become strings

  • Limited metadata: No built-in type information

  • Character encoding: UTF-8 recommended for international data

CSV Best Practices

# For database import
result = tm.flatten(
    data,
    name="db_import",
    id_field="id",           # Use natural IDs for foreign keys
    preserve_types=False,    # Convert all to strings
    skip_null=True,          # Clean data for SQL
    arrays="separate"        # Create relational tables
)
result.save("database_import", output_format="csv")

# For Excel analysis
result = tm.flatten(
    data,
    name="excel_data",
    separator="_",           # Excel-friendly field names
    skip_empty=True,         # Remove empty cells
    arrays="separate"        # Multiple worksheets concept
)
result.save("excel_analysis", output_format="csv")

Parquet Output

Basic Parquet Output

# Save as Parquet files
result.save("output", output_format="parquet")
# Creates:
# output/products.parquet (main table)
# output/products_reviews.parquet (reviews table)

Parquet Advantages

  • Columnar storage: Efficient for analytics queries

  • Compression: Smaller file sizes than JSON/CSV

  • Type preservation: Maintains data types natively

  • Fast queries: Optimized for analytical workloads

  • Schema evolution: Supports schema changes over time

Parquet Requirements

Parquet support requires the pyarrow library:

pip install pyarrow

Parquet Best Practices

# For analytics workloads
result = tm.flatten(
    data,
    name="analytics",
    preserve_types=True,     # Keep numeric types for analysis
    skip_null=False,         # Include nulls for complete picture
    arrays="separate",       # Enable relational analysis
    add_timestamp=True       # Add processing metadata
)
result.save("analytics_data", output_format="parquet")

# For data lake storage
result = tm.flatten(
    data,
    name="lake_data",
    preserve_types=True,     # Maintain type information
    arrays="separate",       # Normalized structure
    id_field="natural_id"    # Consistent identification
)
result.save("data_lake", output_format="parquet")

Streaming Output

Large Dataset Processing

For large datasets, use streaming to write directly to files:

# Stream large datasets to Parquet
tm.flatten_stream(
    large_dataset,
    output_path="streaming_output/",
    name="large_data",
    output_format="parquet",        # Best for large datasets
    batch_size=1000,
    low_memory=True,
    compression="snappy"     # Format-specific option
)

# Stream to JSON for web processing
tm.flatten_stream(
    web_data,
    output_path="web_output/",
    name="web_data",
    output_format="json",
    batch_size=500,
    preserve_types=True
)

# Stream to CSV for database loading
tm.flatten_stream(
    db_data,
    output_path="db_staging/",
    name="staging_data",
    output_format="csv",
    batch_size=2000,
    preserve_types=False     # Strings for SQL compatibility
)

Format Selection Guide

Choose JSON When

  • Building web applications or APIs

  • Need human-readable output for debugging

  • Working with document databases (MongoDB, CouchDB)

  • Preserving complex data structures is important

  • File size is not a primary concern

Choose CSV When

  • Loading data into relational databases

  • Working with Excel or spreadsheet applications

  • Need maximum compatibility across tools

  • Working with legacy systems

  • File size efficiency is important for simple data

Choose Parquet When

  • Building analytical data pipelines

  • Working with big data tools (Spark, Hadoop)

  • Need fast query performance on large datasets

  • Type preservation is critical

  • Working with data lakes or warehouses

  • Compression and storage efficiency are priorities

Format-Specific Optimizations

JSON Optimizations

# Optimize JSON for file size
result = tm.flatten(
    data,
    name="optimized",
    skip_null=True,          # Remove null values
    skip_empty=True,         # Remove empty strings
    preserve_types=False     # Use strings (smaller than numbers in JSON)
)
result.save("compact_json", output_format="json")

# Optimize JSON for processing speed
result = tm.flatten(
    data,
    name="fast_json",
    arrays="inline",         # Fewer files to manage
    low_memory=True,         # Reduce memory pressure
    batch_size=500           # Smaller processing batches
)
result.save("fast_processing", output_format="json")

CSV Optimizations

# Optimize CSV for database loading
result = tm.flatten(
    data,
    name="db_optimized",
    preserve_types=False,    # Consistent string types
    skip_null=True,          # Avoid NULL handling issues
    id_field="natural_id",   # Use natural foreign keys
    separator="_"            # Database-friendly column names
)
result.save("database_ready", output_format="csv")

# Optimize CSV for analytics
result = tm.flatten(
    data,
    name="analytics_csv",
    arrays="separate",       # Enable table joins
    add_timestamp=True,      # Add time dimensions
    preserve_types=False     # Consistent for spreadsheet tools
)
result.save("analytics_ready", output_format="csv")

Parquet Optimizations

# Optimize Parquet for query performance
result = tm.flatten(
    data,
    name="query_optimized",
    preserve_types=True,     # Native type support
    skip_null=False,         # Preserve data completeness
    arrays="separate",       # Normalized for joins
    add_timestamp=True       # Time-based partitioning support
)
result.save("query_ready", output_format="parquet")

# Optimize Parquet for storage efficiency
tm.flatten_stream(
    large_data,
    output_path="efficient_storage/",
    name="compressed_data",
    output_format="parquet",
    preserve_types=True,     # Better compression with types
    batch_size=5000,         # Larger batches for compression
    compression="snappy"     # Fast compression algorithm
)

File Organization Patterns

Single Table Output

# When only main table exists, save as single file
simple_data = {"name": "Product", "price": 99.99}
result = tm.flatten(simple_data, name="simple")

if len(result.tables) == 0:
    result.save("single_product.json")     # Single file
else:
    result.save("product_data")            # Directory with multiple files

Directory Structure

# Multiple tables create directory structure
result.save("product_data", output_format="csv")
# Creates:
# product_data/
#   products.csv
#   products_reviews.csv
#   products_specifications.csv

Naming Conventions

# Use descriptive entity names for clear file names
result = tm.flatten(data, name="customer_orders")
result.save("output", output_format="json")
# Creates:
# output/customer_orders.json
# output/customer_orders_items.json
# output/customer_orders_payments.json

Integration Examples

Database Integration

# Prepare data for PostgreSQL
result = tm.flatten(
    api_data,
    name="customers",
    id_field="customer_id",
    preserve_types=False,
    skip_null=True,
    arrays="separate"
)
result.save("postgres_import", output_format="csv")

# SQL import commands
# COPY customers FROM 'postgres_import/customers.csv' CSV HEADER;
# COPY customers_orders FROM 'postgres_import/customers_orders.csv' CSV HEADER;

Analytics Pipeline

# Prepare data for Spark/Pandas analysis
tm.flatten_stream(
    analytics_data,
    output_path="spark_input/",
    name="events",
    output_format="parquet",
    preserve_types=True,
    arrays="separate",
    batch_size=10000
)

# Use with Spark
# df = spark.read.parquet("spark_input/events.parquet")
# orders_df = spark.read.parquet("spark_input/events_orders.parquet")

Web Application

# Prepare data for web API
result = tm.flatten(
    user_data,
    name="users",
    preserve_types=True,
    arrays="inline",      # Single JSON per user
    skip_null=False       # Complete user profiles
)
result.save("api_data", output_format="json")

# Use in web application
# with open("api_data/users.json") as f:
#     users = json.load(f)

Next Steps