Getting Started¶
Overview¶
Transmog transforms nested data structures into flat, tabular formats while preserving relationships between parent and child records.
Installation¶
pip install transmog # Full install (CSV, Parquet, ORC, Avro output)
pip install transmog[minimal] # CSV only (no pyarrow, fastavro, or cramjam)
Quick Start¶
Basic Data Transformation¶
Transform nested data with a single function call:
import transmog as tm
# Sample nested data
data = {
"company": "TechCorp",
"location": {
"city": "San Francisco",
"country": "USA"
},
"employees": [
{"name": "Alice", "role": "Engineer", "salary": 95000},
{"name": "Bob", "role": "Designer", "salary": 75000}
]
}
# Transform the data
result = tm.flatten(data, name="companies")
# Explore the results
print("Main table:")
print(result.main)
print("\nEmployee table:")
print(result.tables["companies_employees"])
Output:
Main table:
[{
'company': 'TechCorp',
'location_city': 'San Francisco',
'location_country': 'USA',
'_id': 'auto_generated_id',
'_timestamp': '2025-01-15 10:30:00.123456'
}]
The _timestamp field uses a UTC timestamp in YYYY-MM-DD HH:MM:SS.ssssss format.
Note
Timestamp tracking can be disabled by setting time_field=None in
TransmogConfig. See Configuration for details.
Employee table:
[
{
'name': 'Alice',
'role': 'Engineer',
'salary': 95000,
'_parent_id': 'auto_generated_id',
'_id': 'auto_generated_id',
'_timestamp': '2025-01-15 10:30:00.123456'
},
{
'name': 'Bob',
'role': 'Designer',
'salary': 75000,
'_parent_id': 'auto_generated_id',
'_id': 'auto_generated_id',
'_timestamp': '2025-01-15 10:30:00.123456'
}
]
Configuration Examples¶
# Default: types preserved, optimized for analytics
result = tm.flatten(data)
# CSV: includes empty/null values
config = tm.TransmogConfig(include_nulls=True)
result = tm.flatten(data, config=config)
# Memory: small batches (100)
config = tm.TransmogConfig(batch_size=100)
result = tm.flatten(data, config=config)
Behavior¶
Default configuration:
Flattens nested objects:
location.citybecomeslocation_cityKeeps simple arrays (primitives) as native arrays
Extracts complex arrays (objects) into separate tables
Links parent and child records with generated IDs
Working with Files¶
Process files directly:
# Process a JSON file
result = tm.flatten("data.json", name="products")
# Process JSON Lines / NDJSON
result = tm.flatten("data.jsonl", name="logs")
result = tm.flatten("data.ndjson", name="logs")
Important
JSON5 and HJSON formats require additional packages:
JSON5:
pip install json5HJSON:
pip install hjson
# Process JSON5 (with comments, trailing commas, etc.)
result = tm.flatten("config.json5", name="settings")
# Process HJSON (human-friendly JSON)
result = tm.flatten("data.hjson", name="records")
# Save results as CSV
result.save("output", output_format="csv")
# Save results as Parquet
result.save("output", output_format="parquet")
# Save results as ORC
result.save("output", output_format="orc")
Streaming Large Data¶
For large datasets that don’t fit in memory:
# Stream process directly to files
tm.flatten_stream(
large_data,
output_path="output/",
name="large_dataset",
output_format="parquet"
)
Tip
Use flatten_stream() for datasets larger than available RAM. It processes
data in batches and writes directly to disk, using significantly less memory
than flatten().
Functions¶
tm.flatten(data)- ReturnsFlattenResultobject with data in memorytm.flatten_stream(data, output_path)- Writes directly to files
Configuration¶
# Array handling
config = tm.TransmogConfig(array_mode=tm.ArrayMode.SEPARATE)
# ID generation
config = tm.TransmogConfig(id_generation="natural", id_field="product_id")
config = tm.TransmogConfig(id_generation="hash")
config = tm.TransmogConfig(id_generation=["user_id", "date"])
See Array Handling and ID Management for details.
Results¶
result = tm.flatten(data, name="products")
# Access main table
main_data = result.main
# Access specific child table
reviews = result.tables["products_reviews"]
# Get all tables including main
all_tables = result.all_tables
# Table information
print(f"Tables: {list(result.all_tables.keys())}")
print(f"Main table records: {len(result.main)}")
# Access main table records
for record in result.main:
print(record)
Error Handling¶
Errors are raised as exceptions. See Error Handling for details.
Reference¶
result = tm.flatten(data, name="table_name")
result = tm.flatten("input.json", name="table_name")
tm.flatten_stream(data, "output/", name="table_name", output_format="parquet")
result.save("output", output_format="csv")
result.save("output.csv")
main_table = result.main
child_tables = result.tables
all_tables = result.all_tables