JSON for Content Translation & Localization: A Practical Guide
I built Utilitiz to solve a specific problem I kept running into as a CTO: translating and uploading large JSON product catalogs to PIM systems using AI-powered translation APIs. The data was either too long for API limits, or contained too many empty values that prevented clean uploads.
This guide shares everything I've learned about using JSON for multilingual content translation, from structuring data for translation management systems to handling API constraints and automating workflows.
Why JSON for Multilingual Content
When building enterprise AI ecosystems for content translation, I've worked with many formats. JSON emerged as the clear winner for several reasons.
Structured and machine-readable
Unlike plain text or CSV, JSON preserves the relationship between keys and values. When translating product descriptions, you know exactly which field contains what. This is critical for maintaining context during translation.
Universal API format
Every modern translation API accepts JSON. OpenAI, Google Translate, DeepL, Microsoft Translator - they all work with JSON payloads. This means you can switch providers without reformatting your entire dataset.
Easy to version and diff
When managing translations across 10+ languages for e-commerce catalogs, being able to see exactly what changed between versions is invaluable. JSON in Git shows clear diffs.
Direct database integration
Modern databases like PostgreSQL, MongoDB, and Firebase work natively with JSON. I've built translation workflows where JSON goes from API to database to frontend without any transformation.
Common Challenges with JSON in Translation Workflows
Through building AI translation systems for international expansion, I've encountered these challenges repeatedly.
File size limits
The problem that led me to build Utilitiz: AI translation APIs have strict size limits. OpenAI's API has a 4MB payload limit. When translating a 50MB product catalog with 10,000 items, you hit the wall immediately.
I learned this the hard way when uploading translated product data to a PIM system. The entire upload would fail because a single JSON file exceeded the limit.
Empty values and null fields
Product catalogs from legacy systems are full of empty strings, null values, and placeholder text. Translating these wastes API credits and bloats your data. I've seen datasets where 40% of values were empty.
Data integrity during split-translate-merge
When you split a JSON file, translate each part, and merge them back together, one wrong move corrupts the entire dataset. Foreign keys, references, and nested structures need careful handling.
Cost management
Translation APIs charge per character. When translating 10,000 products from English to 10 languages, every byte counts. Cleaning data before translation saved me 35% on translation costs in one project.
Structuring JSON for TMS and PIM Systems
When building translation workflows for Product Information Management systems, structure matters enormously.
Flat vs nested structure
Most TMS (Translation Management Systems) prefer flat structures:
// Good for TMS - flat structure
{
"product_name": "Wireless Mouse",
"product_description": "Ergonomic wireless mouse with 6 buttons",
"product_features_1": "2.4GHz connection",
"product_features_2": "1200 DPI",
"product_features_3": "18-month battery life"
}
But PIM systems often require nested structures:
// Better for PIM - nested structure
{
"product": {
"name": "Wireless Mouse",
"description": "Ergonomic wireless mouse with 6 buttons",
"features": [
"2.4GHz connection",
"1200 DPI",
"18-month battery life"
],
"specifications": {
"weight": "95g",
"dimensions": "10 x 6 x 4 cm"
}
}
}
Language file organization
Two common patterns I've used in production:
Pattern 1: Separate files per language
translations/
en.json
fr.json
de.json
es.json
Each file contains the complete translation for that language.
// en.json
{
"products": [
{"id": "P001", "name": "Wireless Mouse"},
{"id": "P002", "name": "Keyboard"}
]
}
// fr.json
{
"products": [
{"id": "P001", "name": "Souris sans fil"},
{"id": "P002", "name": "Clavier"}
]
}
Pattern 2: Single file with language keys
{
"products": [
{
"id": "P001",
"name": {
"en": "Wireless Mouse",
"fr": "Souris sans fil",
"de": "Kabellose Maus"
}
}
]
}
In my experience, separate files work better for large catalogs because you can process one language at a time without loading the entire dataset.
Cleaning JSON Before Translation
When uploading product data to PIM systems using AI-powered translation, cleaning the data first is essential. Here's my workflow.
Remove empty values
This is exactly why I built Utilitiz's clean function:
// Before cleaning - 420 characters
{
"name": "Product A",
"description": "Great product",
"category": "",
"tags": [],
"price": 29.99,
"discount": null,
"stock": 0,
"metadata": {
"color": "",
"size": null
}
}
// After cleaning - 80 characters (81% reduction)
{
"name": "Product A",
"description": "Great product",
"price": 29.99,
"stock": 0
}
Deduplicate translatable strings
In product catalogs, the same descriptions appear multiple times. Extract unique strings to avoid translating the same text twice.
// JavaScript example for deduplication
function extractUniqueStrings(products) {
const strings = new Set();
const mapping = [];
products.forEach((product, idx) => {
Object.entries(product).forEach(([key, value]) => {
if (typeof value === 'string' && value.trim()) {
strings.add(value);
mapping.push({ productIdx: idx, key, value });
}
});
});
return { unique: Array.from(strings), mapping };
}
Normalize whitespace and encoding
When building content translation workflows, inconsistent encoding breaks everything. I always normalize to UTF-8 and trim whitespace.
# Python example
import json
def clean_for_translation(data):
if isinstance(data, dict):
return {k: clean_for_translation(v) for k, v in data.items() if v not in (None, "", [], {})}
elif isinstance(data, list):
return [clean_for_translation(item) for item in data]
elif isinstance(data, str):
return data.strip()
return data
with open('products.json') as f:
data = json.load(f)
cleaned = clean_for_translation(data)
with open('products_cleaned.json', 'w') as f:
json.dump(cleaned, f, ensure_ascii=False, indent=2)
Splitting Strategies for API Limits
This is the core problem I faced: how do you translate a 50MB JSON file when your API limit is 4MB?
By file size (recommended)
Split based on actual byte size to stay under API limits. This is what Utilitiz does automatically.
function splitBySize(data, maxSizeKB = 4096) {
const parts = [];
let currentPart = [];
let currentSize = 0;
data.forEach(item => {
const itemSize = JSON.stringify(item).length / 1024;
if (currentSize + itemSize > maxSizeKB && currentPart.length > 0) {
parts.push(currentPart);
currentPart = [];
currentSize = 0;
}
currentPart.push(item);
currentSize += itemSize;
});
if (currentPart.length > 0) parts.push(currentPart);
return parts;
}
By item count
Simpler but less precise. Use when items are roughly the same size.
# Python example
def split_by_count(items, chunk_size=100):
return [items[i:i + chunk_size] for i in range(0, len(items), chunk_size)]
products = json.load(open('products.json'))
chunks = split_by_count(products, 100)
for i, chunk in enumerate(chunks):
with open(f'products_part_{i+1}.json', 'w') as f:
json.dump(chunk, f)
By semantic boundaries
When translating hierarchical product categories, split at category boundaries to preserve context.
function splitByCategory(products) {
const byCategory = {};
products.forEach(product => {
const cat = product.category || 'uncategorized';
if (!byCategory[cat]) byCategory[cat] = [];
byCategory[cat].push(product);
});
return Object.values(byCategory);
}
Merging Translated Data Back Together
After translating split files, you need to reassemble them without corrupting data. I've learned these patterns work reliably in production.
Array concatenation
Simplest case: just concatenate arrays in order.
async function mergeTranslatedParts(partFiles) {
const allProducts = [];
for (const file of partFiles.sort()) {
const data = await fs.readFile(file, 'utf8');
const products = JSON.parse(data);
allProducts.push(...products);
}
return allProducts;
}
ID-based merging for updates
When merging translated fields back into original data, match by ID to preserve untranslated fields.
function mergeTranslations(original, translated) {
const translationMap = new Map(
translated.map(item => [item.id, item])
);
return original.map(item => {
const translation = translationMap.get(item.id);
if (!translation) return item;
return {
...item,
...translation,
// Preserve technical fields that shouldn't be translated
id: item.id,
sku: item.sku,
price: item.price
};
});
}
Validation after merge
Always validate after merging to catch data corruption early.
function validateMergedData(original, merged) {
const issues = [];
if (original.length !== merged.length) {
issues.push(`Count mismatch: ${original.length} vs ${merged.length}`);
}
merged.forEach((item, idx) => {
if (!item.id) issues.push(`Missing ID at index ${idx}`);
if (item.id !== original[idx].id) {
issues.push(`ID mismatch at ${idx}: ${item.id} vs ${original[idx].id}`);
}
});
return issues;
}
Real Case Study: Translating Product Catalogs with AI APIs
Let me walk through a real project where I translated 10,000 products from English to 8 languages using OpenAI's API and uploaded them to a PIM system.
The challenge
- 10,000 products, average 800 characters each = ~8MB of text
- Target: 8 languages (French, German, Spanish, Italian, Portuguese, Dutch, Polish, Swedish)
- OpenAI API limit: 4MB per request
- PIM system upload limit: 5MB per file
- Budget constraint: Keep translation costs under $500
The solution
Step 1: Clean the data
Removed empty fields, null values, and duplicate descriptions. This reduced the dataset from 8MB to 5.2MB (35% reduction).
Step 2: Extract translatable strings
Separated translatable text (name, description, features) from technical data (SKU, price, dimensions). Found 3,200 unique strings instead of translating all 10,000 products.
Step 3: Split for API limits
Split 3,200 strings into 2MB chunks, resulting in 3 parts.
Step 4: Batch translate with AI
import openai
import json
def translate_batch(texts, target_lang):
prompt = f"""Translate these product descriptions to {target_lang}.
Return a JSON array with translations in the same order.
Input: {json.dumps(texts)}
"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
return json.loads(response.choices[0].message.content)
# Process each chunk
for chunk in chunks:
for lang in languages:
translations = translate_batch(chunk, lang)
save_translations(translations, lang)
Step 5: Merge translations back
Mapped translations back to original product IDs, combined all languages into one file per language.
Step 6: Validate and upload
Validated data integrity (all IDs present, no missing translations), then uploaded to PIM system in 3MB chunks.
Results
- Total cost: $387 (under budget)
- Time: 2 hours (vs 2 weeks manual translation)
- Quality: 95% accuracy, required minor fixes on technical terms
- Data integrity: 100% - no lost or corrupted products
This project is why I built Utilitiz. The manual process of cleaning, splitting, and merging JSON was error-prone and time-consuming. Automating it saved countless hours.
Tools and Automation Patterns
Here are the tools and workflows I use in production translation systems.
Workflow automation with Make/Zapier
For non-technical teams, I set up automated workflows:
- Product manager uploads JSON to Dropbox
- Webhook triggers cleaning (removes empty values)
- Split into API-sized chunks automatically
- Send each chunk to translation API
- Merge results and validate
- Upload to PIM system
- Send notification when complete
Version control with Git
I always version control translation files. When a product description changes, Git shows exactly what needs retranslation.
git diff en.json fr.json
# Shows which products need new translations
Quality assurance checks
Automated validation before sending to production:
function validateTranslations(original, translated) {
const checks = {
countMatch: original.length === translated.length,
allIDsPresent: translated.every(item => item.id),
noEmptyTranslations: translated.every(item =>
item.name && item.description
),
technicalDataIntact: translated.every((item, idx) =>
item.price === original[idx].price &&
item.sku === original[idx].sku
)
};
return Object.entries(checks).filter(([_, passed]) => !passed);
}
Cost tracking
Translation APIs charge per character. I track costs per project to optimize spending.
function estimateTranslationCost(text, targetLangs, costPerChar = 0.00002) {
const charCount = text.length;
const totalChars = charCount * targetLangs.length;
const cost = totalChars * costPerChar;
return {
characters: totalChars,
estimatedCost: cost.toFixed(2),
languages: targetLangs.length
};
}
Using Utilitiz for translation workflows
I built Utilitiz specifically for these workflows. The tool provides instant cleaning, splitting by size, and merging - all in the browser without uploading sensitive product data to third-party servers.
For teams without developers, it's the fastest way to prepare JSON for translation and reassemble results.