Explaining Database Schemas: Structure, Types, and Use Cases

Written by Admin | Jul 30, 2024 12:07:47 AM

What is a Database Schema?

A database schema defines the logical and structural layout of a database. It describes how data is organized into tables, the fields (or columns) within those tables, the relationships between them, and the rules that govern the data. Schemas are foundational to relational databases, which are built around structured data and governed by the relational model first introduced by Edgar F. Codd in 1970 while working at IBM.

In relational databases like PostgreSQL, MySQL, SQL Server, Oracle, and StarRocks, the schema acts as a contract between the data and its consumers. It specifies what kind of data can exist, how it can relate to other data, and how it can be queried and enforced.

Why are Database Schemas Important?

Database schemas are essential because they enforce order, consistency, and structure. In production environments, where dozens of applications and teams interact with the same data, a schema prevents chaos. Let’s break down their value:

1. Data Integrity and Accuracy

Schemas enforce rules: primary keys, foreign keys, unique constraints, and data types.
For example, in PostgreSQL or MySQL, defining email VARCHAR(255) NOT NULL UNIQUE ensures each record has a unique, valid email.

2. Performance

Schemas enable indexing, which speeds up lookups and filtering.
Query optimizers (like the Cost-Based Optimizer in StarRocks, PostgreSQL, or Oracle) rely on schema metadata to generate efficient execution plans.

3. Maintenance and Collaboration

A clear schema allows developers, data engineers, and analysts to collaborate effectively.
Schema versioning tools like Liquibase, Flyway, or schema registries in data lakes help track changes safely.

4. Future-Proofing

A well-designed schema can evolve with business requirements by supporting extensibility without rework.

Components of a Database Schema

Tables

Tables are the core data containers. Each row represents an entity instance (e.g., a specific customer), and each column represents an attribute (e.g., email address).

Example: A customers table may contain:

customer_id	name	email	signup_date
1	Alice Wong	alice@domain.com	2022-11-01

In StarRocks, tables can be defined as row-based or columnar, depending on workload (OLTP vs. OLAP).

Fields (Columns)

Each field is defined by:

Name
Data type (INT, VARCHAR, DATE, BOOLEAN, etc.)
Constraints (NOT NULL, UNIQUE, etc.)

Fields serve as the database’s first line of defense against bad data.

In PostgreSQL, a field like created_at TIMESTAMP DEFAULT now() ensures time consistency.

Relationships

Relationships define how tables are connected:

One-to-One: A user has one profile.
One-to-Many: One customer can place many orders.
Many-to-Many: Students enroll in multiple courses, often modeled with a join table like enrollments.

These are defined using foreign key constraints and can be enforced or virtualized depending on the engine (StarRocks supports JOINs but does not enforce foreign keys at the storage layer, unlike PostgreSQL).

Schema vs. Instance

Schema:

The definition of structure (tables, fields, relationships).
Defined in DDL (Data Definition Language) SQL like CREATE TABLE.

Instance:

The current snapshot of data stored in the database at a moment in time.

Use Case: In an online store, the schema defines tables like orders, products, and users. The instance is the actual orders and users at 3 PM on April 15, 2025.

Types of Database Schemas

1. Physical Schema

Describes how data is stored on disk — file formats, partitioning, storage engines, compression, indexes.

In PostgreSQL, this includes tablespaces and physical indexes (e.g., B-tree).
In StarRocks, physical layout includes columnar storage, vectorized execution, and primary key indexing.
In data lakes like Apache Iceberg or Delta Lake, the physical schema controls how Parquet or ORC files are laid out and partitioned.

Example: In Apache Iceberg, data is stored in partitioned Parquet files. The schema includes metadata about data types, nullability, and field IDs for schema evolution.

2. Logical Schema

Defines how users see the data — table definitions, field types, relationships, constraints. It abstracts away physical storage.

Example:

Table: transactions with transaction_id, user_id, amount, timestamp
Foreign key: user_id references users(user_id)

Use Case: In banking, the logical schema defines relationships among accounts, transactions, and users for anti-fraud systems.

3. View Schema

A view is a virtual table based on a SELECT query. It doesn’t store data but presents it as if it were a table.

Used for access control, abstraction, or simplified querying.
Often materialized in OLAP systems like Snowflake, BigQuery, or StarRocks as materialized views for performance.

Real Example:

A healthcare system may expose a view that shows anonymized patient data (patient_id, diagnosis) to researchers but hides PII like names or birthdates.

Designing a Database Schema

Designing a schema is one of the most important steps when building a database system. A well-designed schema can save months of engineering time down the road — reducing bugs, improving performance, and making your data easier to work with.

Let’s walk through key design principles and common trade-offs in practice.

1. Start with the Entities and Relationships

Before writing any SQL, you should understand the domain. Ask:

What are the core entities in this system?
How do they relate to each other?

In an e-commerce platform, for example:

You’ll likely have Customers, Products, Orders, and OrderItems.
Each Order belongs to one Customer, and each Order contains many OrderItems.

This kind of basic mapping — sometimes done with Entity-Relationship (ER) diagrams — helps ensure you’re capturing the real-world model accurately.

2. Normalization: Eliminate Redundancy

Normalization means breaking data into smaller related tables so that:

Each table represents a single concept (e.g., a customer, not a customer + order)
Each fact is stored in only one place

This makes updates safer and keeps the data clean. For example:

Instead of storing the customer’s email in every order, you store it once in the Customers table and reference it via CustomerID in Orders.

This is especially important in systems that update or write data frequently — like banking platforms, inventory systems, or CRM software.

You’ll typically go through a few stages of normalization:

1NF ensures each field has atomic values
2NF separates repeating groups
3NF removes indirect dependencies (columns depending on other non-key columns)

3. When to Denormalize

There are cases where joining multiple normalized tables becomes a performance bottleneck — especially in analytical systems where queries run over billions of rows.

Denormalization means reintroducing some redundancy to make read queries faster.

This is common in OLAP databases like StarRocks, ClickHouse, or Snowflake, where:

You might pre-join dimension tables into a single wide table (fact table)
Or create materialized views for frequent aggregations (e.g., daily sales by region)

Real example: In the analytics backend at Demandbase, they moved from ClickHouse to StarRocks and eliminated manual denormalization because StarRocks could handle multi-table joins efficiently in real time — even with millions of rows arriving daily.

Denormalization is also common in recommendation systems, dashboards, and reporting layers where query speed is more important than update frequency.

4. Plan for Schema Evolution

Your schema won’t stay static. New product features, compliance rules, or integrations will require you to:

Add or rename columns
Add new tables or relationships
Deprecate old fields

Modern systems like Apache Iceberg, Delta Lake, and StarRocks support schema evolution — allowing changes without needing to rewrite data.

To prepare for evolution:

Avoid overly rigid designs (e.g., don’t bake dozens of product options into one table)
Use nullable fields where optional data may be added later
Document column meanings clearly to avoid misuse as the schema grows

5. Indexing and Constraints

After tables are designed, you need to think about:

Primary keys (to uniquely identify rows)
Foreign keys (to connect related tables — even if not enforced in OLAP)
Indexes (to speed up filtering and sorting)

For transactional systems, indexes must be chosen carefully — too many can slow down writes. In analytical systems like StarRocks, zone maps and column pruning make queries efficient even without user-defined indexes.

Practical Applications of Database Schemas

Let’s ground this in real-world systems where schema design plays a critical role.

SaaS / Infrastructure: Airtable’s LSDA Project

Use Case: As part of its LSDA (Live Shard Data Archive) project, Airtable needed to validate over 1PB of historical data—trillions of records—against its live transactional state. The goal was to ensure data integrity across archived Apache Iceberg tables on S3 and real-time streams from Kafka.

Schema Design Approach:

Used normalized schemas for both historical and live datasets (e.g., record_id, table_id).
Queried Iceberg and Kafka data directly in StarRocks via external catalogs.
Avoided data duplication by joining across live and archived data on the fly.

Outcomes:

StarRocks enabled millisecond-scale JOINs at petabyte scale—no denormalization or ETL required.
Delivered high-throughput validation via SQL over federated data sources.
Leveraged primary key indexing and vectorized execution for performance.

This architecture let Airtable validate massive datasets in place, with full schema control and minimal operational overhead.

Finance: TRM Labs' Migration from BigQuery to Lakehouse

Use case: TRM Labs, a provider of blockchain intelligence tools for financial crime investigation, migrated from a BigQuery-based architecture to a self-managed lakehouse stack using Apache Iceberg and StarRocks to meet growing demands around performance, cost, and deployment flexibility.

Their analytics platform processes petabytes of blockchain data across 30+ chains, serving over 500 real-time queries per second. BigQuery’s scalability was no longer cost-effective, and Postgres could not keep up with the ingestion rate. The team needed a high-performance, open solution that supported flexible schema design and real-time response.

Schema Design Approach:

Stored raw blockchain and entity data in partitioned Apache Iceberg tables on object storage.
Modeled normalized fact and dimension tables (e.g., transactions, wallets, labels) for analytical workflows.
Served query workloads directly from StarRocks via external Iceberg catalogs, avoiding data duplication.

Outcomes:

StarRocks handled low-latency lookups and JOINs across Iceberg tables — without needing to precompute views.
Enabled real-time insights on high-cardinality, multi-chain data models with SQL-based schema evolution and high concurrency.
Replaced proprietary and costly BigQuery workloads with an open, scalable lakehouse design.

The combination of Iceberg for scalable, versioned storage and StarRocks for fast, schema-aware OLAP allowed TRM to consolidate their architecture while delivering petabyte-scale analytics with sub-second performance.

E-commerce: Shopee’s Real-Time Analytics with StarRocks

Use case: Shopee, one of Southeast Asia’s largest e-commerce platforms, adopted StarRocks across several internal analytics systems — including Data Service, Data Go, and Data Studio — to overcome performance bottlenecks with Presto-on-Hive for JOIN-heavy and high-concurrency queries.

Rather than migrating data, Shopee leveraged StarRocks’ ability to query external Hive tables directly and used materialized views to accelerate key workloads.

Schema Design Approach:

Registered external Hive tables in StarRocks via built-in catalogs.
Built materialized views on top of external tables to precompute complex aggregations.
Retained normalized schemas with no need for denormalization, thanks to StarRocks’ efficient JOIN engine and cost-based optimizer.

Outcomes:

Data Service APIs ran 10× to 2000× faster after replacing Presto queries with MV-backed StarRocks queries — without changing the API layer or ingestion pipelines.
Data Go saw 3–10× faster JOIN queries and 60% lower CPU usage, even with multiple daily snapshot tables.
Data Studio ad-hoc queries ran 2–3× faster on the same hardware footprint, benefiting from StarRocks’ vectorized execution and parallel processing.

By keeping data in place and relying on StarRocks’ SQL capabilities, Shopee achieved major performance and cost gains without redesigning its data lake architecture.

Digital Advertising: Pinterest's Anti-Spam Analytics

Use case: Pinterest operates a large-scale advertising platform where marketers use real-time dashboards like Partner Insights to track and optimize ad performance. Each advertiser can define customized metrics across multiple dimensions — such as audience, platform, and time — making schema flexibility and performance essential.

Previously using Apache Druid, Pinterest faced challenges with denormalized data pipelines, limited SQL support, and growing infrastructure costs. As data scale and customization needs increased, they migrated their analytics backend to StarRocks.

With StarRocks, Pinterest was able to:

Maintain normalized schemas while still performing high-speed JOINs and aggregations on the fly
Simplify ingestion pipelines by eliminating MapReduce dependencies
Use standard SQL, enabling easier integration with internal tools
Improve query latency by 50% and achieve 10-second data freshness, while reducing infrastructure footprint by over 60%

This shift allowed them to support highly flexible, multi-dimensional analytics at lower cost and with better performance — critical for a platform serving millions of advertisers in real time.

Frequently Asked Questions (FAQ) on Database Schemas

1. What exactly is a database schema, and how is it different from the actual data?

A database schema is the structure or blueprint of a database. It defines:

What tables exist
What columns each table has
What data types and constraints those columns use
How tables relate to each other

Think of it like the architectural plan for a building. It describes what the rooms are and how they connect, but not what’s inside them. The actual data (e.g., customer names, transactions) is the instance, not the schema.

2. Is schema only relevant for relational databases?

Primarily, yes. Relational databases (PostgreSQL, MySQL, Oracle, StarRocks, etc.) are schema-driven by design.

However, modern semi-structured and schema-on-read systems (like MongoDB, Elasticsearch, or data lakes) also rely on implicit or external schema definitions:

MongoDB has schema validation rules
Apache Iceberg, Delta Lake, and Hudi manage evolving table schemas even though data is stored as files

So even in flexible environments, schema matters — especially for query performance, consistency, and governance.

3. What are the types of database schemas?

There are three main types:

Physical Schema: How data is physically stored (file formats, partitioning, indexes)
Logical Schema: The abstract structure (tables, fields, relationships)
View Schema: Virtualized schemas built from queries (views, materialized views)

These exist in both OLTP (e.g., PostgreSQL) and OLAP (e.g., StarRocks, Snowflake) systems, although implementation details differ.

4. What’s the difference between a view and a materialized view?

A view is a virtual table built on a query. It executes the query at runtime.
A materialized view stores the query result physically, updating it periodically or incrementally.

Use Case:

Views are great for abstraction and access control.
Materialized views are ideal for speeding up expensive analytics queries — e.g., daily revenue by region.

Example: StarRocks and Snowflake support automatic query rewrites using materialized views for performance optimization.

5. When should I normalize my schema?

Normalize when:

You need strong data integrity (e.g., in banking or ERP systems)
You have frequent writes/updates, and minimizing redundancy matters
You want easier schema evolution and smaller storage footprint

Normalization is essential for OLTP systems like PostgreSQL, MySQL, and SQL Server.

6. When is denormalization better?

Denormalize when:

You are optimizing for read-heavy analytics, not write-heavy workloads
JOINs become performance bottlenecks at scale
Your engine can’t perform JOINs efficiently (e.g., legacy OLAP engines like Druid or Presto)

7. What tools help manage schema changes?

For transactional databases:

Liquibase, Flyway: for versioned schema migrations
ERD tools (like dbdiagram.io, DBeaver) for visual modeling

For data lakes:

Apache Iceberg, Delta Lake, Hudi: handle schema evolution with metadata
Amundsen, DataHub, OpenMetadata: help document and discover schemas

8. What’s the role of schema in query performance?

Schema affects performance through:

Data types (smaller, fixed-size types = faster reads)
Indexing strategy (primary keys, B-trees, bitmap indexes)
Partitioning (by date, region, etc.)
Column ordering (especially in columnar stores)

In StarRocks, query optimizers use schema information to prune partitions, apply vectorized scans, and avoid unnecessary joins — enabling sub-second response times even on large datasets.

9. Can a schema evolve over time? How do modern systems handle this?

Yes — schema evolution is common. You might:

Add new columns
Rename existing ones
Change data types (with caution)

In data lakes: Iceberg and Delta Lake support safe, versioned schema evolution.

In StarRocks: You can add/modify columns in-place, and materialized views adapt to new columns if refreshed.

Always validate downstream impact — e.g., BI dashboards or ETL jobs that depend on specific column names.

10. Does StarRocks enforce schema constraints like foreign keys?

No. Like most OLAP systems, StarRocks does not enforce foreign keys or unique constraints at runtime.

Instead, it:

Supports primary key models for deduplication and upserts
Allows JOINs across related tables, assuming you manage the integrity upstream
Prioritizes performance and analytical flexibility over strict transactional enforcement

This is standard in analytical systems, including ClickHouse, Druid, and BigQuery.

11. What’s a good naming convention for schema objects?

Consistent naming improves maintainability. Here are common practices:

Use snake_case or lowerCamelCase consistently
Prefix dimension tables (dim_customer) and fact tables (fact_sales) in analytics schemas
Avoid reserved words (e.g., “order”, “user”)
Use descriptive but concise column names (e.g., created_at, not just date)

12. How does schema design impact cost in cloud environments?

Poorly designed schemas can lead to:

Excessive storage costs (due to redundancy)
Higher query costs (longer scan times)
Slower pipeline performance (more complex ETL logic)

A well-designed schema:

Reduces data duplication
Makes queries faster (fewer joins, smaller scans)
Enables selective materialization, which lowers compute usage

Example: Pinterest reduced infrastructure usage by over 60% after migrating to StarRocks with normalized schemas and efficient JOINs.

View full post