Skip to main content

Input Validation & Output Encoding

Prevent injection attacks through validation and encoding

TL;DR

Input Validation: Check all user input. Accept only expected format, length, type. Reject or sanitize unexpected. Output Encoding: Encode data based on context (HTML, JavaScript, URL, SQL). Prevent injection by making special characters literal. Use whitelist validation (accept known-good) rather than blacklist (reject known-bad). Encode output based on where it's used.

Learning Objectives

  • Implement whitelist-based input validation
  • Distinguish validation from sanitization
  • Apply context-appropriate encoding (HTML, JS, URL)
  • Prevent injection attacks through defense in depth
  • Design validation in API contracts

Motivating Scenario

A user enters <script>alert('XSS')</script> in a comment field. Without validation and encoding:

  1. App stores the raw string in database
  2. Page retrieves and displays the string
  3. Browser parses the <script> tag and executes JavaScript
  4. Attacker's script runs, steals session tokens, redirects user to phishing site

With proper validation and encoding:

  1. Input validation rejects HTML-like strings (if not expected)
  2. Output encoding converts < to &lt;, > to &gt;
  3. Browser displays literal text: &lt;script&gt;alert('XSS')&lt;/script&gt;
  4. No code execution, user sees the string as-is

Real-world impact: Cross-site scripting (XSS) is #3 in OWASP Top 10, leads to credential theft, malware distribution, account takeover.

Core Concepts

Validation vs Sanitization vs Encoding

Validation (Accept/Reject)
  1. Check input against whitelist
  2. Reject if doesn't match
  3. Example: email must match pattern
  4. Early defense (client and server)
  5. Best approach
Sanitization (Clean)
  1. Remove dangerous parts
  2. Modify input to safe version
  3. Example: strip HTML tags
  4. Can be error-prone
  5. Secondary defense

Encoding (Context-Dependent)

Context matters. The same data needs different encoding depending on where it's used:

Encoding depends on output context, not input.
HTML Context: <div>USER_INPUT</div>
Encode:
< → &lt;
> → &gt;
& → &amp;
" → &quot;
' → &#x27;

Example: "A & B" → "A &amp; B"
Example: "<script>" → "&lt;script&gt;"

Practical Examples

// Whitelist-based validation (GOOD)
function validateEmail(email) {
// Accept only known-good format
const pattern = /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;
return pattern.test(email);
}

function validateUsername(username) {
// Alphanumeric and underscores, 3-20 chars
return /^[a-zA-Z0-9_]{3,20}$/.test(username);
}

function validatePostalCode(code) {
// US postal codes: 5 digits or 5+4
return /^\d{5}(-\d{4})?$/.test(code);
}

// Blacklist validation (BAD - avoid this)
function validateEmailBad(email) {
// Try to reject known-bad patterns
// Attackers bypass with encoding, alternate formats
if (email.includes('<')) return false;
if (email.includes('>')) return false;
return true; // Too permissive!
}

// Server-side validation (always do this!)
app.post('/register', (req, res) => {
const email = req.body.email;

// Validate on server, not just client
if (!validateEmail(email)) {
return res.status(400).json({ error: 'Invalid email' });
}

// Safe to use email
createUser(email);
});

Patterns and Pitfalls

Try to reject known-bad patterns (e.g., reject ''). Attackers bypass with HTML encoding, case variation, or alternate syntax.
Accept only known-good format (e.g., accept email matching regex). Simpler, safer, doesn't require knowing all attacks.
Encode for HTML context but not JavaScript context. Same data, different output locations, needs different encoding.
Determine output context (HTML, JS, URL, SQL) and encode accordingly. Use libraries, not custom encoding.
Validate on client for UX, but attacker can bypass client-side checks. Real attack happens server-side.
Validate input + encode output + use CSP + parameterized queries + HttpOnly cookies. Multiple layers.

Self-Check

  • What's the difference between validation and encoding?
  • Why doesn't blacklist validation work against injections?
  • If data is HTML-encoded, is it safe to use in JavaScript context?
  • Why do you need different encoding for different output contexts?
  • What's the purpose of Content Security Policy (CSP)?

Design Review Checklist

  • All user input validated against whitelist (not blacklist)?
  • Validation on server-side (not just client)?
  • Output encoded based on context (HTML, JS, URL, SQL)?
  • Parameterized queries used for database access?
  • Input length limits enforced?
  • Special characters handled safely?
  • APIs document validation rules?
  • HTML sanitization used for user-generated HTML (e.g., comments)?
  • No dangerous functions used (eval, innerHTML, document.write)?
  • CSP header configured?
  • HttpOnly and Secure flags on cookies?
  • Tests verify validation and encoding behavior?
  • Developers trained on injection attacks?

Real-World Breach Examples

Case Study 1: Yahoo Stored XSS (2013)

Vulnerability: User could inject JavaScript in profile Impact: 3000 accounts compromised before patched Root Cause: No output encoding in profile display

Fix: HTML-encode all user-generated content before display

Case Study 2: PayPal SQL Injection (2002)

Vulnerability: Login form allowed SQL injection Impact: Access to all user accounts Root Cause: String concatenation instead of parameterized queries

Fix: Parameterized queries prevent injection entirely

Case Study 3: Target Breach (2013)

Vulnerability: Malicious JavaScript injected into checkout Impact: 40M credit cards stolen Root Cause: Third-party vendor account compromised, injected JS

Fix: Content Security Policy (CSP) would have blocked injected script

Next Steps

  1. Audit current code — Identify all user input points (forms, APIs, uploads, URLs)
  2. Implement whitelist validation — Define expected format for each input
  3. Use encoding libraries — DOMPurify, OWASP Encoder, database drivers (don't write own)
  4. Implement CSP — Restrict inline scripts, limit external sources
  5. Test injection attacks — Manually try XSS, SQL injection, command injection payloads
  6. Security training — Ensure team understands context-aware encoding
  7. Automated testing — Use SAST tools (SonarQube, Snyk) to catch validation gaps

References