Database Optimization
"Indexing frequently accessed database columns can significantly speed up retrieval."
Database optimization is crucial for maintaining high performance as data volume and query complexity grow.
Indexing Strategies
Index Types
B-Tree Index
Characteristics:
- Balanced tree structure: O(log n) search complexity
- Range queries: Efficient for range operations
- Most common: Default index type in most databases
- Ordered storage: Maintains key order
Use Cases:
- Equality queries (WHERE id = 123)
- Range queries (WHERE age BETWEEN 20 AND 30)
- Sorting operations (ORDER BY created_at)
- Prefix searches (WHERE name LIKE 'john%')
Example:
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_orders_created_at ON orders(created_at);
Hash Index
Characteristics:
- Hash table structure: O(1) average lookup
- Equality only: No range queries
- Memory intensive: Requires more memory
- Fast lookups: Excellent for exact matches
Use Cases:
- Exact match queries (WHERE username = 'john')
- Key-value lookups
- In-memory databases
- Caching systems
Composite Index
Characteristics:
- Multiple columns: Single index on multiple fields
- Column order matters: Leftmost prefix rule
- Reduced storage: More efficient than multiple indexes
- Query optimization: Covers more query types
Example:
-- Index on (last_name, first_name)
CREATE INDEX idx_users_name ON users(last_name, first_name);
-- Supports these queries:
SELECT * FROM users WHERE last_name = 'Smith';
SELECT * FROM users WHERE last_name = 'Smith' AND first_name = 'John';
-- Does NOT support: WHERE first_name = 'John'
Partial Index
Characteristics:
- Conditional indexing: Index subset of rows
- Smaller storage: Reduced index size
- Faster maintenance: Less overhead
- Targeted optimization: Specific use cases
Example:
-- Index only active users
CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';
-- Index recent orders
CREATE INDEX idx_recent_orders ON orders(user_id) WHERE created_at > '2023-01-01';
Index Design Principles
Selectivity
- High selectivity: Few rows per value (good for indexing)
- Low selectivity: Many rows per value (poor for indexing)
- Cardinality: Number of unique values
- Rule of thumb: Index columns with high selectivity
Leftmost Prefix Rule
- Composite indexes: Use leftmost columns first
- Query optimization: Match index column order
- Index usage: Must use prefix of index columns
- Planning: Consider common query patterns
Covering Indexes
- All columns in index: No table access needed
- Improved performance: Eliminates lookups
- Storage trade-off: Larger index size
- Query optimization: Ideal for specific queries
Example:
CREATE INDEX idx_order_summary ON orders(user_id, total, status);
-- This query uses only the index:
SELECT user_id, total, status FROM orders WHERE user_id = 123;
Query Optimization
Query Analysis
EXPLAIN Plan
Purpose: Understand query execution strategy Usage: Analyze and optimize query performance Components: Access methods, join strategies, cost estimates
Example:
EXPLAIN ANALYZE
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.status = 'active'
AND o.created_at > '2023-01-01';
Common Performance Issues
- Full table scans: No index usage
- Nested loops: Inefficient join strategies
- Temporary tables: Memory/disk overhead
- Sorting operations: Expensive ORDER BY clauses
Optimization Techniques
Query Rewriting
- **Avoid SELECT ***: Specify needed columns
- Use appropriate JOINs: INNER vs OUTER joins
- Optimize WHERE clauses: SARGable predicates
- Minimize subqueries: Use JOINs instead
Before:
SELECT * FROM users WHERE id IN (
SELECT user_id FROM orders WHERE total > 1000
);
After:
SELECT DISTINCT u.* FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.total > 1000;
Join Optimization
- Join order: Start with most selective table
- Join types: Choose appropriate join algorithm
- Index usage: Ensure join columns indexed
- Join elimination: Remove unnecessary joins
Subquery Optimization
- Correlated subqueries: Often slow
- Derived tables: Materialized subqueries
- Common table expressions: Query organization
- Window functions: Alternative to self-joins
Database Configuration
Memory Configuration
Buffer Pool
- Purpose: Cache frequently accessed data
- Size: 70-80% of available RAM
- Impact: Major performance factor
- Monitoring: Hit ratio tracking
MySQL Example:
-- Set buffer pool size to 70% of RAM
SET GLOBAL innodb_buffer_pool_size = 10737418240; -- 10GB
Query Cache
- Purpose: Cache query results
- Usefulness: Limited in modern applications
- Invalidation: Complex cache invalidation
- Alternative: Application-level caching
Connection Management
Connection Pooling
- Purpose: Reuse database connections
- Benefits: Reduced connection overhead
- Configuration: Pool size, timeout settings
- Monitoring: Pool utilization tracking
Example Configuration:
# SQLAlchemy connection pool
from sqlalchemy import create_engine
engine = create_engine(
'postgresql://user:pass@localhost/db',
pool_size=20,
max_overflow=30,
pool_timeout=30,
pool_recycle=3600
)
Connection Limits
- Max connections: Database connection limit
- Application limits: Application-side limits
- Monitoring: Connection usage tracking
- Optimization: Appropriate limit setting
Performance Monitoring
Key Metrics
Query Performance
- Execution time: Query duration
- Rows examined: Data processed
- Index usage: Index efficiency
- Disk I/O: Storage access patterns
System Metrics
- CPU usage: Processor utilization
- Memory usage: RAM consumption
- Disk I/O: Storage performance
- Network I/O: Network utilization
Monitoring Tools
Database-Specific Tools
- MySQL: Slow query log, Performance Schema
- PostgreSQL: pg_stat_statements, EXPLAIN ANALYZE
- MongoDB: Profiler, explain() command
- Redis: SLOWLOG, INFO command
General Tools
- Prometheus: Metrics collection
- Grafana: Visualization dashboards
- New Relic: APM and monitoring
- Datadog: Infrastructure monitoring
Optimization Examples
Index Optimization
-- Before: No index on user_id
EXPLAIN SELECT * FROM orders WHERE user_id = 123;
-- Result: Full table scan
-- After: Add index
CREATE INDEX idx_orders_user_id ON orders(user_id);
EXPLAIN SELECT * FROM orders WHERE user_id = 123;
-- Result: Index scan
Query Optimization
-- Before: Inefficient subquery
SELECT u.name FROM users u
WHERE u.id IN (
SELECT o.user_id FROM orders o
WHERE o.total > 1000
);
-- After: Efficient JOIN
SELECT DISTINCT u.name
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.total > 1000;
Configuration Optimization
-- PostgreSQL configuration
-- Memory settings
shared_buffers = 256MB
effective_cache_size = 1GB
work_mem = 4MB
-- Connection settings
max_connections = 100
shared_preload_libraries = 'pg_stat_statements'
Best Practices
Index Management
- Monitor usage: Track index effectiveness
- Remove unused indexes: Reduce maintenance overhead
- Regular maintenance: Rebuild fragmented indexes
- Plan for growth: Anticipate query patterns
Query Optimization
- Profile queries: Identify slow operations
- Test changes: Validate improvements
- Document patterns: Share optimization techniques
- Review regularly: Continuous optimization
Configuration Management
- Monitor metrics: Track performance indicators
- Adjust gradually: Make incremental changes
- Test thoroughly: Validate configuration changes
- Document settings: Maintain configuration history
Key Takeaway: Database optimization is an iterative process requiring continuous monitoring, analysis, and refinement based on actual usage patterns and performance requirements.