JSON for Content Translation & Localization: A Practical Guide

ยท 16 min read

I built Utilitiz to solve a specific problem I kept running into as a CTO: translating and uploading large JSON product catalogs to PIM systems using AI-powered translation APIs. The data was either too long for API limits, or contained too many empty values that prevented clean uploads.

This guide shares everything I've learned about using JSON for multilingual content translation, from structuring data for translation management systems to handling API constraints and automating workflows.

Why JSON for Multilingual Content

When building enterprise AI ecosystems for content translation, I've worked with many formats. JSON emerged as the clear winner for several reasons.

Structured and machine-readable

Unlike plain text or CSV, JSON preserves the relationship between keys and values. When translating product descriptions, you know exactly which field contains what. This is critical for maintaining context during translation.

Universal API format

Every modern translation API accepts JSON. OpenAI, Google Translate, DeepL, Microsoft Translator - they all work with JSON payloads. This means you can switch providers without reformatting your entire dataset.

Easy to version and diff

When managing translations across 10+ languages for e-commerce catalogs, being able to see exactly what changed between versions is invaluable. JSON in Git shows clear diffs.

Direct database integration

Modern databases like PostgreSQL, MongoDB, and Firebase work natively with JSON. I've built translation workflows where JSON goes from API to database to frontend without any transformation.

Common Challenges with JSON in Translation Workflows

Through building AI translation systems for international expansion, I've encountered these challenges repeatedly.

File size limits

The problem that led me to build Utilitiz: AI translation APIs have strict size limits. OpenAI's API has a 4MB payload limit. When translating a 50MB product catalog with 10,000 items, you hit the wall immediately.

I learned this the hard way when uploading translated product data to a PIM system. The entire upload would fail because a single JSON file exceeded the limit.

Empty values and null fields

Product catalogs from legacy systems are full of empty strings, null values, and placeholder text. Translating these wastes API credits and bloats your data. I've seen datasets where 40% of values were empty.

Data integrity during split-translate-merge

When you split a JSON file, translate each part, and merge them back together, one wrong move corrupts the entire dataset. Foreign keys, references, and nested structures need careful handling.

Cost management

Translation APIs charge per character. When translating 10,000 products from English to 10 languages, every byte counts. Cleaning data before translation saved me 35% on translation costs in one project.

Structuring JSON for TMS and PIM Systems

When building translation workflows for Product Information Management systems, structure matters enormously.

Flat vs nested structure

Most TMS (Translation Management Systems) prefer flat structures:

// Good for TMS - flat structure
{
  "product_name": "Wireless Mouse",
  "product_description": "Ergonomic wireless mouse with 6 buttons",
  "product_features_1": "2.4GHz connection",
  "product_features_2": "1200 DPI",
  "product_features_3": "18-month battery life"
}

But PIM systems often require nested structures:

// Better for PIM - nested structure
{
  "product": {
    "name": "Wireless Mouse",
    "description": "Ergonomic wireless mouse with 6 buttons",
    "features": [
      "2.4GHz connection",
      "1200 DPI",
      "18-month battery life"
    ],
    "specifications": {
      "weight": "95g",
      "dimensions": "10 x 6 x 4 cm"
    }
  }
}

Language file organization

Two common patterns I've used in production:

Pattern 1: Separate files per language

translations/
  en.json
  fr.json
  de.json
  es.json

Each file contains the complete translation for that language.

// en.json
{
  "products": [
    {"id": "P001", "name": "Wireless Mouse"},
    {"id": "P002", "name": "Keyboard"}
  ]
}

// fr.json
{
  "products": [
    {"id": "P001", "name": "Souris sans fil"},
    {"id": "P002", "name": "Clavier"}
  ]
}

Pattern 2: Single file with language keys

{
  "products": [
    {
      "id": "P001",
      "name": {
        "en": "Wireless Mouse",
        "fr": "Souris sans fil",
        "de": "Kabellose Maus"
      }
    }
  ]
}

In my experience, separate files work better for large catalogs because you can process one language at a time without loading the entire dataset.

Cleaning JSON Before Translation

When uploading product data to PIM systems using AI-powered translation, cleaning the data first is essential. Here's my workflow.

Remove empty values

This is exactly why I built Utilitiz's clean function:

// Before cleaning - 420 characters
{
  "name": "Product A",
  "description": "Great product",
  "category": "",
  "tags": [],
  "price": 29.99,
  "discount": null,
  "stock": 0,
  "metadata": {
    "color": "",
    "size": null
  }
}

// After cleaning - 80 characters (81% reduction)
{
  "name": "Product A",
  "description": "Great product",
  "price": 29.99,
  "stock": 0
}

Deduplicate translatable strings

In product catalogs, the same descriptions appear multiple times. Extract unique strings to avoid translating the same text twice.

// JavaScript example for deduplication
function extractUniqueStrings(products) {
  const strings = new Set();
  const mapping = [];

  products.forEach((product, idx) => {
    Object.entries(product).forEach(([key, value]) => {
      if (typeof value === 'string' && value.trim()) {
        strings.add(value);
        mapping.push({ productIdx: idx, key, value });
      }
    });
  });

  return { unique: Array.from(strings), mapping };
}

Normalize whitespace and encoding

When building content translation workflows, inconsistent encoding breaks everything. I always normalize to UTF-8 and trim whitespace.

# Python example
import json

def clean_for_translation(data):
    if isinstance(data, dict):
        return {k: clean_for_translation(v) for k, v in data.items() if v not in (None, "", [], {})}
    elif isinstance(data, list):
        return [clean_for_translation(item) for item in data]
    elif isinstance(data, str):
        return data.strip()
    return data

with open('products.json') as f:
    data = json.load(f)
    cleaned = clean_for_translation(data)

with open('products_cleaned.json', 'w') as f:
    json.dump(cleaned, f, ensure_ascii=False, indent=2)

Splitting Strategies for API Limits

This is the core problem I faced: how do you translate a 50MB JSON file when your API limit is 4MB?

By file size (recommended)

Split based on actual byte size to stay under API limits. This is what Utilitiz does automatically.

function splitBySize(data, maxSizeKB = 4096) {
  const parts = [];
  let currentPart = [];
  let currentSize = 0;

  data.forEach(item => {
    const itemSize = JSON.stringify(item).length / 1024;

    if (currentSize + itemSize > maxSizeKB && currentPart.length > 0) {
      parts.push(currentPart);
      currentPart = [];
      currentSize = 0;
    }

    currentPart.push(item);
    currentSize += itemSize;
  });

  if (currentPart.length > 0) parts.push(currentPart);
  return parts;
}

By item count

Simpler but less precise. Use when items are roughly the same size.

# Python example
def split_by_count(items, chunk_size=100):
    return [items[i:i + chunk_size] for i in range(0, len(items), chunk_size)]

products = json.load(open('products.json'))
chunks = split_by_count(products, 100)

for i, chunk in enumerate(chunks):
    with open(f'products_part_{i+1}.json', 'w') as f:
        json.dump(chunk, f)

By semantic boundaries

When translating hierarchical product categories, split at category boundaries to preserve context.

function splitByCategory(products) {
  const byCategory = {};

  products.forEach(product => {
    const cat = product.category || 'uncategorized';
    if (!byCategory[cat]) byCategory[cat] = [];
    byCategory[cat].push(product);
  });

  return Object.values(byCategory);
}

Merging Translated Data Back Together

After translating split files, you need to reassemble them without corrupting data. I've learned these patterns work reliably in production.

Array concatenation

Simplest case: just concatenate arrays in order.

async function mergeTranslatedParts(partFiles) {
  const allProducts = [];

  for (const file of partFiles.sort()) {
    const data = await fs.readFile(file, 'utf8');
    const products = JSON.parse(data);
    allProducts.push(...products);
  }

  return allProducts;
}

ID-based merging for updates

When merging translated fields back into original data, match by ID to preserve untranslated fields.

function mergeTranslations(original, translated) {
  const translationMap = new Map(
    translated.map(item => [item.id, item])
  );

  return original.map(item => {
    const translation = translationMap.get(item.id);
    if (!translation) return item;

    return {
      ...item,
      ...translation,
      // Preserve technical fields that shouldn't be translated
      id: item.id,
      sku: item.sku,
      price: item.price
    };
  });
}

Validation after merge

Always validate after merging to catch data corruption early.

function validateMergedData(original, merged) {
  const issues = [];

  if (original.length !== merged.length) {
    issues.push(`Count mismatch: ${original.length} vs ${merged.length}`);
  }

  merged.forEach((item, idx) => {
    if (!item.id) issues.push(`Missing ID at index ${idx}`);
    if (item.id !== original[idx].id) {
      issues.push(`ID mismatch at ${idx}: ${item.id} vs ${original[idx].id}`);
    }
  });

  return issues;
}

Real Case Study: Translating Product Catalogs with AI APIs

Let me walk through a real project where I translated 10,000 products from English to 8 languages using OpenAI's API and uploaded them to a PIM system.

The challenge

  • 10,000 products, average 800 characters each = ~8MB of text
  • Target: 8 languages (French, German, Spanish, Italian, Portuguese, Dutch, Polish, Swedish)
  • OpenAI API limit: 4MB per request
  • PIM system upload limit: 5MB per file
  • Budget constraint: Keep translation costs under $500

The solution

Step 1: Clean the data

Removed empty fields, null values, and duplicate descriptions. This reduced the dataset from 8MB to 5.2MB (35% reduction).

Step 2: Extract translatable strings

Separated translatable text (name, description, features) from technical data (SKU, price, dimensions). Found 3,200 unique strings instead of translating all 10,000 products.

Step 3: Split for API limits

Split 3,200 strings into 2MB chunks, resulting in 3 parts.

Step 4: Batch translate with AI

import openai
import json

def translate_batch(texts, target_lang):
    prompt = f"""Translate these product descriptions to {target_lang}.
Return a JSON array with translations in the same order.

Input: {json.dumps(texts)}
"""
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )

    return json.loads(response.choices[0].message.content)

# Process each chunk
for chunk in chunks:
    for lang in languages:
        translations = translate_batch(chunk, lang)
        save_translations(translations, lang)

Step 5: Merge translations back

Mapped translations back to original product IDs, combined all languages into one file per language.

Step 6: Validate and upload

Validated data integrity (all IDs present, no missing translations), then uploaded to PIM system in 3MB chunks.

Results

  • Total cost: $387 (under budget)
  • Time: 2 hours (vs 2 weeks manual translation)
  • Quality: 95% accuracy, required minor fixes on technical terms
  • Data integrity: 100% - no lost or corrupted products

This project is why I built Utilitiz. The manual process of cleaning, splitting, and merging JSON was error-prone and time-consuming. Automating it saved countless hours.

Tools and Automation Patterns

Here are the tools and workflows I use in production translation systems.

Workflow automation with Make/Zapier

For non-technical teams, I set up automated workflows:

  1. Product manager uploads JSON to Dropbox
  2. Webhook triggers cleaning (removes empty values)
  3. Split into API-sized chunks automatically
  4. Send each chunk to translation API
  5. Merge results and validate
  6. Upload to PIM system
  7. Send notification when complete

Version control with Git

I always version control translation files. When a product description changes, Git shows exactly what needs retranslation.

git diff en.json fr.json
# Shows which products need new translations

Quality assurance checks

Automated validation before sending to production:

function validateTranslations(original, translated) {
  const checks = {
    countMatch: original.length === translated.length,
    allIDsPresent: translated.every(item => item.id),
    noEmptyTranslations: translated.every(item =>
      item.name && item.description
    ),
    technicalDataIntact: translated.every((item, idx) =>
      item.price === original[idx].price &&
      item.sku === original[idx].sku
    )
  };

  return Object.entries(checks).filter(([_, passed]) => !passed);
}

Cost tracking

Translation APIs charge per character. I track costs per project to optimize spending.

function estimateTranslationCost(text, targetLangs, costPerChar = 0.00002) {
  const charCount = text.length;
  const totalChars = charCount * targetLangs.length;
  const cost = totalChars * costPerChar;

  return {
    characters: totalChars,
    estimatedCost: cost.toFixed(2),
    languages: targetLangs.length
  };
}

Using Utilitiz for translation workflows

I built Utilitiz specifically for these workflows. The tool provides instant cleaning, splitting by size, and merging - all in the browser without uploading sensitive product data to third-party servers.

For teams without developers, it's the fastest way to prepare JSON for translation and reassemble results.