# ID Management This guide covers how Transmog handles record identification, including automatic ID generation, natural ID fields, and relationship management. ## ID System Overview Transmog uses ID fields to: - Uniquely identify each record - Link child records to their parents - Maintain data relationships during transformation - Enable data reconstruction and joining ## Automatic ID Generation By default, Transmog generates unique IDs for all records: ```python import transmog as tm data = { "product": { "name": "Laptop", "reviews": [ {"rating": 5, "comment": "Great"}, {"rating": 4, "comment": "Good"} ] } } result = tm.flatten(data, name="products") # Automatic IDs are generated print("Main record:", result.main[0]) # {'product_name': 'Laptop', '_id': 'generated_unique_id'} print("Review records:", result.tables["products_reviews"]) # [ # {'rating': '5', 'comment': 'Great', '_parent_id': 'generated_unique_id'}, # {'rating': '4', 'comment': 'Good', '_parent_id': 'generated_unique_id'} # ] ``` ### ID Field Names Default ID field names can be customized: ```python # Custom parent ID field name result = tm.flatten( data, name="products", parent_id_field="parent_ref" ) # Child records use custom parent field name print(result.tables["products_reviews"][0]) # {'rating': '5', 'comment': 'Great', 'parent_ref': 'generated_id'} ``` ## Natural ID Fields Existing ID fields in the data can be used instead of generated ones: ### Single ID Field ```python data = { "product": { "product_id": "PROD123", "name": "Gaming Laptop", "reviews": [ {"review_id": "REV456", "rating": 5}, {"review_id": "REV789", "rating": 4} ] } } # Use existing product_id field result = tm.flatten(data, name="products", id_field="product_id") print("Main record:", result.main[0]) # {'product_id': 'PROD123', 'product_name': 'Gaming Laptop', '_id': 'PROD123'} print("Review records:", result.tables["products_reviews"]) # [ # {'review_id': 'REV456', 'rating': '5', '_parent_id': 'PROD123'}, # {'review_id': 'REV789', 'rating': '4', '_parent_id': 'PROD123'} # ] ``` ### Table-Specific ID Fields Different tables can use different ID fields: ```python data = { "company": { "company_id": "COMP123", "name": "TechCorp", "employees": [ {"employee_id": "EMP001", "name": "Alice"}, {"employee_id": "EMP002", "name": "Bob"} ], "offices": [ {"office_id": "OFF001", "city": "San Francisco"}, {"office_id": "OFF002", "city": "New York"} ] } } # Different ID fields for different tables result = tm.flatten(data, name="company", id_field={ "": "company_id", # Main table uses company_id "company_employees": "employee_id", # Employee table uses employee_id "company_offices": "office_id" # Office table uses office_id }) print("Employee records:", result.tables["company_employees"]) # [ # {'employee_id': 'EMP001', 'name': 'Alice', '_parent_id': 'COMP123', '_id': 'EMP001'}, # {'employee_id': 'EMP002', 'name': 'Bob', '_parent_id': 'COMP123', '_id': 'EMP002'} # ] ``` ### Fallback to Generated IDs When specified ID fields are missing, Transmog falls back to generated IDs: ```python data = [ {"product_id": "PROD123", "name": "Laptop"}, # Has ID field {"name": "Mouse"}, # Missing ID field {"product_id": "PROD456", "name": "Keyboard"} # Has ID field ] result = tm.flatten(data, name="products", id_field="product_id") # Records with missing ID fields get generated IDs for record in result.main: print(f"Name: {record['name']}, ID: {record['_id']}") # Name: Laptop, ID: PROD123 # Name: Mouse, ID: generated_id # Name: Keyboard, ID: PROD456 ``` ## Relationship Management ### Parent-Child Links Child records always reference their parent through the parent ID field: ```python # Build parent-child relationship map def map_relationships(result): parent_map = {} # Index main records by ID for record in result.main: parent_map[record["_id"]] = { "parent": record, "children": {} } # Map child records to parents for table_name, records in result.tables.items(): for record in records: parent_id = record["_parent_id"] if parent_id in parent_map: if table_name not in parent_map[parent_id]["children"]: parent_map[parent_id]["children"][table_name] = [] parent_map[parent_id]["children"][table_name].append(record) return parent_map relationships = map_relationships(result) ``` ### Multi-Level Hierarchies Nested arrays create multi-level ID relationships: ```python data = { "company": { "company_id": "COMP123", "departments": [ { "dept_id": "DEPT001", "name": "Engineering", "teams": [ {"team_id": "TEAM001", "name": "Frontend"}, {"team_id": "TEAM002", "name": "Backend"} ] } ] } } result = tm.flatten(data, name="company", id_field={ "": "company_id", "company_departments": "dept_id", "company_departments_teams": "team_id" }) # Three-level hierarchy print("Company:", result.main[0]["_id"]) # COMP123 print("Department:", result.tables["company_departments"][0]["_id"]) # DEPT001 print("Team parent:", result.tables["company_departments_teams"][0]["_parent_id"]) # DEPT001 ``` ## ID Generation Strategies ### Deterministic IDs Natural IDs provide deterministic, reproducible results: ```python # Same data with same natural IDs produces identical results data1 = {"product_id": "PROD123", "name": "Laptop"} data2 = {"product_id": "PROD123", "name": "Laptop"} result1 = tm.flatten(data1, name="products", id_field="product_id") result2 = tm.flatten(data2, name="products", id_field="product_id") # IDs are deterministic assert result1.main[0]["_id"] == result2.main[0]["_id"] print("Deterministic ID:", result1.main[0]["_id"]) # PROD123 ``` ### Generated ID Consistency Generated IDs are unique within a processing session but not across sessions: ```python # Generated IDs are unique but not deterministic across runs result1 = tm.flatten(data, name="products") # No id_field specified result2 = tm.flatten(data, name="products") # No id_field specified # IDs will be different between runs print("Run 1 ID:", result1.main[0]["_id"]) print("Run 2 ID:", result2.main[0]["_id"]) ``` ## ID Validation and Quality ### Missing ID Detection Check for records with missing natural IDs: ```python def check_id_coverage(result, expected_id_field): """Check how many records have natural vs generated IDs.""" natural_ids = 0 generated_ids = 0 for record in result.main: if expected_id_field in record and record[expected_id_field]: natural_ids += 1 else: generated_ids += 1 return natural_ids, generated_ids natural, generated = check_id_coverage(result, "product_id") print(f"Natural IDs: {natural}, Generated IDs: {generated}") ``` ### ID Uniqueness Validation Verify ID uniqueness across tables: ```python def validate_id_uniqueness(result): """Validate that all IDs are unique.""" all_ids = set() duplicates = [] # Check main table IDs for record in result.main: id_value = record["_id"] if id_value in all_ids: duplicates.append(id_value) all_ids.add(id_value) # Check child table IDs (if they have _id fields) for table_name, records in result.tables.items(): for record in records: if "_id" in record: id_value = record["_id"] if id_value in all_ids: duplicates.append(id_value) all_ids.add(id_value) return duplicates duplicates = validate_id_uniqueness(result) if duplicates: print(f"Duplicate IDs found: {duplicates}") ``` ### Orphaned Record Detection Check for child records without valid parents: ```python def find_orphaned_records(result): """Find child records with invalid parent references.""" main_ids = {record["_id"] for record in result.main} orphaned = {} for table_name, records in result.tables.items(): table_orphans = [] for record in records: parent_id = record["_parent_id"] if parent_id not in main_ids: table_orphans.append(record) if table_orphans: orphaned[table_name] = table_orphans return orphaned orphaned = find_orphaned_records(result) for table, records in orphaned.items(): print(f"Orphaned records in {table}: {len(records)}") ``` ## Advanced ID Scenarios ### Composite Natural IDs When natural IDs are complex, use string concatenation: ```python data = [ {"region": "US", "store": "001", "product": "laptop"}, {"region": "EU", "store": "002", "product": "mouse"} ] # Create composite ID from multiple fields def create_composite_id(record): return f"{record.get('region', '')}_{record.get('store', '')}_{record.get('product', '')}" # Pre-process data to add composite ID for record in data: record["composite_id"] = create_composite_id(record) result = tm.flatten(data, name="sales", id_field="composite_id") print("Composite IDs:", [r["_id"] for r in result.main]) # ['US_001_laptop', 'EU_002_mouse'] ``` ### Conditional ID Fields Use different ID fields based on data conditions: ```python def determine_id_field(data): """Determine appropriate ID field based on data structure.""" if isinstance(data, list) and data: sample = data[0] if "primary_id" in sample: return "primary_id" elif "id" in sample: return "id" elif "uuid" in sample: return "uuid" return None # Use generated IDs # Determine ID field dynamically id_field = determine_id_field(data) result = tm.flatten(data, name="records", id_field=id_field) ``` ## Metadata Enhancement ### Timestamp Addition Add processing timestamps to records: ```python # Add timestamps for audit trails result = tm.flatten( data, name="events", id_field="event_id", add_timestamp=True ) # Records include timestamp metadata print("Record with timestamp:", result.main[0]) # {'event_id': 'EVT123', 'name': 'User Login', '_id': 'EVT123', '_timestamp': '2024-01-15T10:30:00'} ``` ### Custom Metadata Add custom metadata during processing: ```python # Custom metadata can be added post-processing result = tm.flatten(data, name="records", id_field="record_id") # Add processing metadata processing_info = { "processed_at": "2024-01-15T10:30:00", "version": "1.0", "source": "api_import" } for record in result.main: record.update(processing_info) ``` ## Best Practices ### ID Field Selection Choose ID fields based on requirements: ```python # For reproducible results result = tm.flatten(data, name="products", id_field="product_id") # For simplicity when IDs don't matter result = tm.flatten(data, name="products") # Use generated IDs # For complex scenarios with multiple entity types result = tm.flatten(data, name="entities", id_field={ "": "entity_id", "entities_children": "child_id", "entities_metadata": "meta_id" }) ``` ### ID Validation Pipeline Implement ID validation in data pipelines: ```python def validate_and_process(data, id_config): """Validate IDs before processing.""" # Pre-validation if id_config.get("required_fields"): missing = check_required_id_fields(data, id_config["required_fields"]) if missing: raise ValueError(f"Missing required ID fields: {missing}") # Process data result = tm.flatten(data, name="validated", **id_config) # Post-validation duplicates = validate_id_uniqueness(result) if duplicates: raise ValueError(f"Duplicate IDs detected: {duplicates}") orphaned = find_orphaned_records(result) if orphaned: print(f"Warning: Orphaned records found in {list(orphaned.keys())}") return result ``` ## Next Steps - **[Output Formats](output-formats.md)** - Choose formats that preserve ID relationships - **[Error Handling](error-handling.md)** - Handle ID-related processing errors - **[Performance Guide](../developer_guide/performance.md)** - Optimize ID processing for large datasets