Flattening Nested JSON Objects in Python for Data Analysis

Flattening Nested JSON Objects in Python for Data Analysis

Flattening Nested JSON Objects in Python for Data Analysis

 

Introduction

JSON, or JavaScript Object Notation, is one of the most common data formats used in APIs, configuration files, and data pipelines. However, JSON data can often become deeply nested, making it difficult to analyze directly. In this post, we’ll explore how to recursively flatten a deeply nested JSON object into a tabular format that’s easier to handle using Python. You’ll learn how to design a reusable flattening function, manage nested lists and dictionaries, and optimize performance for large datasets.

1. Understanding the Problem: Why Flatten JSON?

Before jumping into the code, it helps to understand what we mean by “flattening” JSON. Imagine a JSON like this:

{
    "user": {
        "id": 1,
        "name": "Alice",
        "address": {
            "city": "Wonderland",
            "details": {
                "zip": "12345",
                "country": "Fictional"
            }
        }
    },
    "plan": "premium"
}

The nested structure is intuitive for humans and APIs, but not so friendly for analysis. For example, in Pandas or SQL-like tables, we prefer flat structures like:

{
    "user.id": 1,
    "user.name": "Alice",
    "user.address.city": "Wonderland",
    "user.address.details.zip": "12345",
    "user.address.details.country": "Fictional",
    "plan": "premium"
}

Flattening transforms the hierarchical structure into a dictionary where each nested key path becomes a single key. This is crucial when loading data into tabular structures like Pandas DataFrames or data warehouses.

2. Writing a Recursive JSON Flattener

The most robust way to handle nested JSON is by using recursion — a function that calls itself until no nested dictionaries remain. Here’s a basic implementation:

def flatten_json(nested_json, parent_key='', sep='.'): 
    items = []
    for key, value in nested_json.items():
        new_key = f"{parent_key}{sep}{key}" if parent_key else key
        if isinstance(value, dict):
            items.extend(flatten_json(value, new_key, sep=sep).items())
        else:
            items.append((new_key, value))
    return dict(items)

sample = {
    "user": {"id": 1, "info": {"name": "Alice", "age": 25}},
    "plan": "premium"
}

flat = flatten_json(sample)
print(flat)

This outputs:

{'user.id': 1, 'user.info.name': 'Alice', 'user.info.age': 25, 'plan': 'premium'}

How it works: The function uses a depth-first recursion. Each time it encounters a nested dictionary, it recurses deeper, appending the full key path. The result is a flattened dictionary ready for tabular analysis.

3. Handling Nested Lists inside JSON

Many JSON structures also include lists, which can complicate flattening since lists introduce index-based keys. Let’s modify our function to handle lists gracefully:

def flatten_json(nested_json, parent_key='', sep='.'): 
    items = []
    if isinstance(nested_json, dict):
        for key, value in nested_json.items():
            new_key = f"{parent_key}{sep}{key}" if parent_key else key
            items.extend(flatten_json(value, new_key, sep=sep).items())
    elif isinstance(nested_json, list):
        for i, item in enumerate(nested_json):
            new_key = f"{parent_key}[{i}]"
            items.extend(flatten_json(item, new_key, sep=sep).items())
    else:
        items.append((parent_key, nested_json))
    return dict(items)

Now even lists of dictionaries are handled:

example = {
    "users": [
        {"id": 1, "name": "Alice"},
        {"id": 2, "name": "Bob"}
    ]
}
print(flatten_json(example))

Output:

{'users[0].id': 1, 'users[0].name': 'Alice', 'users[1].id': 2, 'users[1].name': 'Bob'}

This notation (`users[0].id`) makes it easy to trace back the original data hierarchy while still allowing for a flat, column-friendly output.

4. Converting Flattened JSON to a Pandas DataFrame

Once flattened, data can quickly be transformed into a Pandas DataFrame for analysis. Here’s a practical example combining both steps:

import pandas as pd

data = [
    {"user": {"id": 1, "address": {"city": "NY"}}, "plan": "basic"},
    {"user": {"id": 2, "address": {"city": "LA"}}, "plan": "pro"}
]

flat_data = [flatten_json(record) for record in data]
df = pd.DataFrame(flat_data)
print(df)

This prints:

   user.id user.address.city   plan
0        1               NY  basic
1        2               LA    pro

Why it’s useful: Each flattened key now becomes a DataFrame column — perfect for analysis, filtering, and exporting to CSV. This workflow is common in ETL processes, API response handling, and data validation scripts.

5. Performance and Optimization Tips

Flattening large JSON objects can become computationally expensive if done naively, especially when the structure is deeply nested. Here are a few optimization recommendations:

  • Use Iterative Flattening for Extremely Deep JSONs: For JSONs with more than 1000 levels of nesting, recursion might hit Python’s recursion limit. You can rewrite the function using a stack or queue for an iterative approach.
  • Leverage Multiprocessing: When flattening a list of large JSONs, use the `concurrent.futures` module to parallelize computations.
  • Consider Libraries: Libraries like `flatten_json` or `json_normalize` from Pandas provide optimized implementations, especially useful in production ETL workloads.
  • Memory Awareness: Flattening large datasets can temporarily consume a lot of memory — stream JSON data where possible, or flatten in chunks.

Here’s a snippet of how multiprocessing could speed things up:

from concurrent.futures import ProcessPoolExecutor

records = [large_json_1, large_json_2, large_json_3]

with ProcessPoolExecutor() as executor:
    flattened_list = list(executor.map(flatten_json, records))

This approach spreads the workload across CPU cores, making flattening faster for huge datasets.

6. Wrapping Up

Flattening nested JSON in Python helps bridge the gap between hierarchical data structures and tabular data analysis. With recursion, flexible handling of nested lists, and performance optimizations, you can efficiently prepare complex JSON data for analysis in Pandas or SQL databases. A well-designed flattening function becomes a fundamental tool in your data engineering toolkit.

In summary:

  • Flattening transforms nested JSON into easy-to-analyze dictionaries.
  • Recursion is a powerful pattern for hierarchical data traversal.
  • Proper handling of lists and performance considerations makes your solution production-ready.

With these techniques, you’ll be well-prepared to handle even the most complicated JSON data with confidence and clarity.

 

Useful links: