Dimensional Data Modeling Cheatsheet¶
AI Assisted (Claude Sonnet 4)
Core Concepts¶
What is Dimensional Modeling?¶
A design technique for data warehouses that organizes data into facts (measurable events) and dimensions (descriptive context) to support fast analytical queries and business intelligence.
Key Benefits¶
- Query Performance: Optimized for analytical workloads
- Business User Friendly: Intuitive structure matching business processes
- Scalable: Handles large data volumes efficiently
- Flexible: Supports drilling down and rolling up through hierarchies
Building Blocks¶
Fact Tables¶
Central tables containing quantitative, measurable data about business events.
Characteristics: - Large number of rows (transactions, events) - Foreign keys to dimension tables - Additive measures (sales amount, quantity) - Sparse data (many null values acceptable)
Types of Facts: - Transactional: Individual business events - Periodic Snapshot: Regular intervals (daily, monthly) - Accumulating Snapshot: Process lifecycle tracking
Dimension Tables¶
Provide descriptive context for facts - the "who, what, when, where, why" of business events.
Characteristics: - Relatively small number of rows - Wide tables with many descriptive attributes - Primary keys (often surrogate keys) - Denormalized for query performance
Common Dimensions: - Customer: Demographics, preferences, segments - Product: Categories, brands, specifications - Time: Hierarchies, fiscal calendars, holidays - Geography: Regions, territories, locations
Schema Patterns¶
Star Schema¶
Structure: Central fact table directly connected to dimension tables
Customer
|
Product — Fact — Store
|
Date
Advantages: - Simple structure, easy to understand - Faster query performance (fewer joins) - Optimized for business intelligence tools
Use When: - Query performance is priority - Business users need simple models - Storage redundancy is acceptable
Snowflake Schema¶
Structure: Normalized dimensions broken into multiple related tables
Customer
|
City — State — Country
|
Product — Fact — Store
| |
Category Region
|
Brand
Advantages: - Reduced data redundancy - Better data integrity - Smaller storage footprint
Use When: - Storage efficiency is critical - Large dimension tables need normalization - Data integrity is paramount
Galaxy Schema¶
Structure: Multiple fact tables sharing common dimensions
Dimension1 — Fact1 — Dimension2
| |
| Fact2 — Dimension3
| |
Dimension4 ————————
Use When: - Multiple business processes - Complex enterprise requirements - Shared dimensions across processes
Essential Design Principles¶
Grain Definition¶
Most Critical Decision: What does each row in the fact table represent?
Examples: - One row per sales transaction - One row per daily inventory snapshot - One row per customer interaction
Rules: - Must be consistent throughout fact table - Determines what questions can be answered - All measures must be at same grain level
Slowly Changing Dimensions (SCD)¶
Handle changes to dimensional attributes over time.
Type 1: Overwrite - Replace old value with new value - No history preserved - Use for: Corrections, low-importance changes
UPDATE Customer SET City = 'New York' WHERE CustomerID = 123
Type 2: Add New Record - Create new record with effective dates - Preserves complete history - Use for: Important attribute changes
-- Original record
CustomerKey | CustomerID | Name | City | EffectiveDate | ExpirationDate
501 | 123 | John | Boston | 2020-01-01 | 2023-12-31
-- New record after move
502 | 123 | John | NYC | 2024-01-01 | 9999-12-31
Type 3: Add New Column - Keep both old and new values - Limited history (usually current + previous) - Use for: Predictable changes, limited tracking
CustomerKey | Name | CurrentCity | PreviousCity
501 | John | NYC | Boston
Surrogate Keys¶
Artificial primary keys (usually integers) for dimension records.
Benefits: - Stable references when business keys change - Better join performance - Handle multiple source systems - Support slowly changing dimensions
Implementation:
-- Instead of natural key
ProductKey | ProductCode | ProductName | Category
201 | SKU-12345 | iPhone 15 | Electronics
-- Fact table references surrogate key
SalesKey | ProductKey | CustomerKey | SalesAmount
1001 | 201 | 501 | 999.00
Common Patterns¶
Conformed Dimensions¶
Shared dimensions across multiple fact tables ensuring consistent analysis.
Example: - Customer dimension used by Sales, Support, and Marketing facts - Date dimension shared across all business processes - Product dimension used by Sales, Inventory, and Returns
Benefits: - Consistent metrics across business areas - Integrated reporting capabilities - Simplified ETL processes
Degenerate Dimensions¶
Dimensional data stored directly in fact table (no separate dimension table).
Common Examples: - Order numbers, invoice numbers - Transaction IDs, ticket numbers - Serial numbers, batch codes
When to Use: - Only identifier exists (no descriptive attributes) - High cardinality with little analytical value - Performance optimization for grouping
-- OrderNumber stored in fact table
SalesID | CustomerKey | ProductKey | OrderNumber | SalesAmount
1001 | 501 | 201 | ORD-12345 | 125.99
1002 | 501 | 203 | ORD-12345 | 89.50
Junk Dimensions¶
Consolidate low-cardinality flags and indicators.
Example:
-- Instead of separate dimensions for each flag
JunkKey | PaymentType | ShippingType | GiftWrap | Priority
901 | Credit | Standard | No | Normal
902 | Credit | Express | Yes | High
903 | Debit | Standard | No | Normal
Benefits: - Reduces fact table width - Eliminates tiny dimension tables - Groups related operational flags
Role-Playing Dimensions¶
Same dimension used multiple times in single fact table.
Example:
-- Date dimension used for multiple date types
SalesID | CustomerKey | OrderDateKey | ShipDateKey | DeliveryDateKey
1001 | 501 | 20240115 | 20240116 | 20240118
Implementation: - Create views or aliases for each role - Use descriptive names in fact table - Maintain same dimension structure
Design Best Practices¶
Fact Table Design¶
- Choose Atomic Grain: Store data at lowest level of detail
- Additive Measures: Prefer measures that can be summed
- Consistent Grain: All measures at same level of detail
- Minimize NULLs: Use dimension attributes instead of fact NULLs
Dimension Design¶
- Descriptive Attributes: Include all relevant business context
- User-Friendly Names: Use business terminology
- Hierarchies: Support drill-down analysis paths
- Surrogate Keys: Always use for dimension primary keys
Performance Optimization¶
-
Indexing Strategy:
- Clustered indexes on fact table date columns
- Foreign key indexes on fact tables
- Covering indexes for common query patterns
-
Partitioning:
- Partition large fact tables by date
- Consider customer-based partitioning for very large customers
- Implement partition elimination in queries
-
Aggregation:
- Pre-aggregate common summary levels
- Create OLAP cubes for frequent analysis patterns
- Maintain incremental refresh processes
Implementation Checklist¶
Planning Phase¶
- Define business requirements and KPIs
- Identify source systems and data quality
- Determine fact table grain and measures
- Design dimension hierarchies and attributes
- Plan for slowly changing dimensions
Development Phase¶
- Create staging area for data cleansing
- Implement surrogate key generation
- Build ETL processes with error handling
- Create data quality validation rules
- Implement SCD processing logic
Testing Phase¶
- Validate business rules and calculations
- Test slowly changing dimension scenarios
- Verify referential integrity
- Performance test with realistic data volumes
- User acceptance testing with business users
Deployment Phase¶
- Set up production environments
- Implement monitoring and alerting
- Create user documentation and training
- Establish maintenance procedures
- Plan for ongoing support and enhancements
Common Pitfalls to Avoid¶
Design Mistakes¶
- Inconsistent Grain: Mixing transaction and summary data in same fact
- Over-normalization: Creating snowflake when star would suffice
- Missing Business Keys: Not storing original system identifiers
- Inadequate SCD Planning: Not considering how dimensions change
Performance Issues¶
- Too Many Dimensions: Overly wide fact tables
- Large Dimension Tables: Not using mini-dimensions for large cardinality
- Poor Partitioning: Not aligning with query patterns
- Missing Indexes: Inadequate indexing strategy
Data Quality Problems¶
- Duplicate Records: Lack of proper matching and deduplication
- Orphaned Facts: Facts without corresponding dimensions
- Inconsistent Formatting: Different formats across source systems
- Late Arriving Data: Not handling out-of-sequence data loads
Quick Reference¶
When to Use Star vs Snowflake¶
Star Schema: - Performance is critical - Business users directly query - Storage cost is not major concern - Simpler maintenance requirements
Snowflake Schema: - Storage efficiency important - Large dimension tables - Strong normalization requirements - Complex data relationships
SCD Type Selection¶
Type 1: Corrections, unimportant changes Type 2: Critical business changes requiring history Type 3: Predictable changes with limited history needs
Grain Selection Guidelines¶
- Start with most atomic level possible
- Consider storage and performance implications
- Ensure all business questions can be answered
- Plan for future analytical requirements
E-commerce Star Schema Example¶
Business Scenario¶
Imagine you're working at an online retailer that sells consumer electronics. The business stakeholders come to you with a requirement: "We need to analyze our sales performance to understand which products are selling well, who our best customers are, and how our different stores are performing."
This is a classic dimensional modeling challenge. Let's walk through the thought process of designing a data warehouse to solve this problem.
Understanding the Business Questions¶
Before we start designing tables, we need to understand what questions the business wants to answer. The stakeholders tell us they want to know:
- Which products are our top sellers by month?
- Who are our highest-value customers by region?
- How do weekend sales compare to weekday sales?
- Which stores generate the most revenue?
- What's the average order value by customer segment?
These questions give us clues about what our dimensional model should look like. We can see that we need to track sales events (facts) and analyze them by product, customer, time, and store (dimensions).
Choosing the Grain¶
The most critical decision in dimensional modeling is selecting the grain - what does each row in our fact table represent? We have several options:
We could track at the order level (one row per order), but this wouldn't let us analyze individual products. We could track at the daily summary level (one row per product per day), but this might lose important detail.
For our e-commerce scenario, we choose line item level - one row per product in each order. This atomic grain gives us maximum flexibility. We can always aggregate up to order level or daily level, but we can't drill down if we start with summarized data.
Designing Our Star Schema¶
erDiagram
FACT_SALES {
bigint sales_key
int customer_key
int product_key
int date_key
int store_key
varchar order_number
int quantity
decimal unit_price
decimal discount_amount
decimal total_amount
}
DIM_CUSTOMER {
int customer_key
varchar customer_id
varchar customer_name
varchar email
varchar city
varchar state
varchar country
varchar customer_segment
date registration_date
}
DIM_PRODUCT {
int product_key
varchar product_id
varchar product_name
varchar category
varchar brand
decimal cost
decimal list_price
varchar status
}
DIM_DATE {
int date_key
date full_date
int year
int quarter
int month
varchar month_name
int day
varchar day_name
char is_weekend
char is_holiday
}
DIM_STORE {
int store_key
varchar store_id
varchar store_name
varchar store_type
varchar city
varchar state
varchar region
varchar manager
}
DIM_CUSTOMER ||--o{ FACT_SALES : "purchases"
DIM_PRODUCT ||--o{ FACT_SALES : "sold"
DIM_DATE ||--o{ FACT_SALES : "on_date"
DIM_STORE ||--o{ FACT_SALES : "sold_at"
Fact Table Design Reasoning¶
Our fact table contains the measurable events - the actual sales transactions. Let's think through each component:
Surrogate Key: sales_key
is an artificial primary key. We don't use the natural order number because multiple line items can exist per order.
Foreign Keys: Each foreign key points to a dimension table. Notice how customer_key
, product_key
, date_key
, and store_key
connect our fact to the descriptive information.
Degenerate Dimension: order_number
stays in the fact table because while it's dimensional (we want to group by order), it has no descriptive attributes that warrant a separate table.
Measures: These are the numbers we want to analyze:
- quantity
: How many items were sold
- unit_price
: Price per item (may differ from list price due to promotions)
- discount_amount
: Discount applied to this line item
- total_amount
: Final amount for this line item
Notice that all measures are at the same grain (per line item) and are mostly additive - we can sum them across different dimensions to get meaningful totals.
Dimension Design Reasoning¶
Customer Dimension - The "Who"¶
We chose SCD Type 1 (overwrite changes) for simplicity. In a real scenario, you might want SCD Type 2 if tracking customer moves is important for regional analysis.
The dimension includes geographic information (city, state, country) because the business wants regional analysis. Customer segmentation helps with targeted analysis.
Product Dimension - The "What"¶
Products are organized in a simple hierarchy: Category → Brand → Product. This denormalized approach (keeping all levels in one table) follows star schema principles - it's faster to query than a normalized snowflake approach.
We include both cost and list price because: - Cost enables profit analysis - List price vs. unit price in facts shows promotional impact
Date Dimension - The "When"¶
The date dimension is crucial for time-based analysis. We include: - Calendar attributes (year, quarter, month) - Day-of-week information for weekday vs. weekend analysis - Holiday flags for seasonal analysis
Each date gets its own row (one row per day), and we'll have thousands of rows spanning several years.
Store Dimension - The "Where"¶
This covers both physical stores and online channels. The store_type
field distinguishes between "Physical", "Online", and "Mobile" channels.
Geographic hierarchy (city, state, region) enables territorial analysis for operations and sales management.
How the Model Answers Business Questions¶
Let's trace through how our design answers the original business questions:
"Which products are our top sellers by month?"
SELECT
d.year, d.month_name, p.product_name,
SUM(f.quantity) as units_sold,
SUM(f.total_amount) as revenue
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
WHERE d.year = 2024
GROUP BY d.year, d.month_name, p.product_name
ORDER BY d.month, revenue DESC;
"How do weekend sales compare to weekday sales?"
SELECT
CASE WHEN d.is_weekend = 'Y' THEN 'Weekend' ELSE 'Weekday' END as day_type,
COUNT(*) as transaction_count,
SUM(f.total_amount) as total_sales,
AVG(f.total_amount) as avg_transaction_value
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
GROUP BY CASE WHEN d.is_weekend = 'Y' THEN 'Weekend' ELSE 'Weekday' END;
Notice how the star schema makes these queries intuitive - we join the fact table to the dimensions we need and aggregate the measures.
Design Trade-offs and Alternatives¶
Why Star Instead of Snowflake? We could have normalized the product dimension into separate Category and Brand tables. However, the star schema approach offers: - Simpler queries (fewer joins) - Better performance for analytical workloads - Easier understanding for business users - Most BI tools are optimized for star schemas
Why This Grain? Line-item level gives us maximum analytical flexibility. The trade-off is larger fact table size, but storage is usually less expensive than the loss of analytical capability.
SCD Type Decisions We chose Type 1 for simplicity, but in production you might want: - Type 2 for Customer if regional sales history matters when customers move - Type 2 for Product if price history analysis is important - Type 1 for Store (store characteristics rarely change significantly)
Extending the Model¶
As business requirements evolve, this model can grow:
Additional Dimensions: Promotion, Sales Rep, Payment Method Additional Facts: Returns, Inventory, Customer Service interactions Enhanced Attributes: Customer demographics, product specifications, store performance metrics
The key is starting with a solid foundation that answers the core business questions, then evolving the model based on new requirements.
Sample Data to Visualize the Concept¶
-- Sample fact data showing the grain
sales_key | customer_key | product_key | date_key | store_key | order_number | quantity | unit_price | total_amount
1001 | 501 | 201 | 20240315 | 101 | ORD-12345 | 2 | 99.99 | 199.98
1002 | 501 | 203 | 20240315 | 101 | ORD-12345 | 1 | 49.99 | 49.99
1003 | 502 | 201 | 20240316 | 102 | ORD-12346 | 1 | 99.99 | 99.99
-- Customer dimension sample
customer_key | customer_id | customer_name | city | state | customer_segment
501 | CUST-001 | John Smith | Seattle | WA | VIP
502 | CUST-002 | Jane Doe | Portland| OR | Regular
-- Product dimension sample
product_key | product_id | product_name | category | brand | list_price
201 | SKU-001 | iPhone 15 | Electronics | Apple | 999.99
203 | SKU-003 | AirPods Pro | Electronics | Apple | 249.99
Notice how the same order (ORD-12345) appears in multiple fact records because John Smith bought two different products. This is the line-item grain in action.
Key Takeaways for Interviews¶
When discussing this model in an interview context, emphasize:
- Grain Selection: Explain why line-item level was chosen and what alternatives existed
- Business Alignment: Show how the design directly supports stated business requirements
- Query Performance: Demonstrate how star schema enables fast, intuitive queries
- Scalability: Discuss how the model can evolve with new requirements
- Trade-offs: Acknowledge the storage vs. flexibility trade-off and why it's worth it
This example demonstrates solid dimensional modeling fundamentals while remaining simple enough to explain clearly in an interview setting.
Clinical Trial Galaxy Schema Example¶
Business Scenario¶
You're working as a data architect for a pharmaceutical company conducting a Phase III clinical trial for a new diabetes medication. The study involves 500 patients across 25 clinical sites over 18 months. The regulatory team, biostatisticians, and clinical operations team have come to you with complex analytical requirements.
This scenario differs fundamentally from typical business intelligence challenges because clinical trials operate under strict regulatory requirements (FDA, EMA), require sophisticated statistical analysis, and involve patient safety monitoring. Let's explore how dimensional modeling adapts to these unique challenges.
Understanding Clinical Trial Complexity¶
Clinical trials generate multiple types of data that need to be analyzed together:
Efficacy Data: How well does the drug work? Primary endpoint is reduction in HbA1c (blood sugar control). Secondary endpoints include weight loss, blood pressure improvement, and quality of life scores.
Safety Data: What adverse events occur? Every headache, nausea, or serious event must be tracked, categorized, and analyzed for patterns. Regulatory agencies require detailed safety reporting.
Operational Data: Are we following the protocol? Visit scheduling, protocol deviations, and site performance all impact trial success and regulatory approval.
The key insight is that we have fundamentally different types of business events happening - patient measurements and adverse events - that need separate fact tables but must be analyzed together. This leads us to a galaxy schema design.
Designing the Galaxy Schema¶
erDiagram
FACT_PATIENT_MEASUREMENT {
bigint measurement_key
int patient_key
int visit_key
int measurement_type_key
int date_key
int site_key
varchar measurement_id
decimal measurement_value
varchar measurement_unit
decimal baseline_value
decimal change_from_baseline
char within_normal_range
char measurement_quality_flag
}
FACT_ADVERSE_EVENT {
bigint adverse_event_key
int patient_key
int visit_key
int event_type_key
int date_key
int severity_key
varchar event_id
varchar event_description
int duration_days
char related_to_study_drug
char serious_event_flag
char resolved_flag
}
DIM_PATIENT {
int patient_key
varchar patient_id
varchar site_patient_id
int age_at_enrollment
varchar gender
varchar race
varchar ethnicity
decimal baseline_hba1c
decimal baseline_weight
int diabetes_duration_years
varchar enrollment_status
date enrollment_date
date completion_date
}
DIM_VISIT {
int visit_key
varchar visit_code
varchar visit_name
varchar visit_type
int visit_number
int study_day
varchar visit_window
char baseline_visit_flag
char primary_endpoint_flag
varchar visit_status
}
DIM_MEASUREMENT_TYPE {
int measurement_type_key
varchar measurement_code
varchar measurement_name
varchar measurement_category
varchar body_system
decimal normal_range_min
decimal normal_range_max
varchar standard_unit
char primary_endpoint_flag
char safety_parameter_flag
}
DIM_SITE {
int site_key
varchar site_id
varchar site_name
varchar principal_investigator
varchar city
varchar state
varchar country
varchar site_type
date site_activation_date
char site_status
}
DIM_DATE {
int date_key
date full_date
int year
int quarter
int month
varchar month_name
int week
int day_of_week
varchar day_name
int study_day
varchar study_phase
char is_visit_window
}
DIM_EVENT_TYPE {
int event_type_key
varchar event_code
varchar event_term
varchar system_organ_class
varchar preferred_term
varchar event_category
char expected_event_flag
varchar regulatory_classification
}
DIM_SEVERITY {
int severity_key
varchar severity_code
varchar severity_name
varchar severity_description
int severity_rank
char hospitalization_required
char life_threatening_flag
}
DIM_PATIENT ||--o{ FACT_PATIENT_MEASUREMENT : "measured_for"
DIM_VISIT ||--o{ FACT_PATIENT_MEASUREMENT : "during_visit"
DIM_MEASUREMENT_TYPE ||--o{ FACT_PATIENT_MEASUREMENT : "type_of"
DIM_DATE ||--o{ FACT_PATIENT_MEASUREMENT : "on_date"
DIM_SITE ||--o{ FACT_PATIENT_MEASUREMENT : "at_site"
DIM_PATIENT ||--o{ FACT_ADVERSE_EVENT : "experienced_by"
DIM_VISIT ||--o{ FACT_ADVERSE_EVENT : "during_visit"
DIM_EVENT_TYPE ||--o{ FACT_ADVERSE_EVENT : "type_of"
DIM_DATE ||--o{ FACT_ADVERSE_EVENT : "on_date"
DIM_SEVERITY ||--o{ FACT_ADVERSE_EVENT : "severity_level"
Why Galaxy Schema Instead of Star?¶
The fundamental decision here is using multiple fact tables instead of trying to force everything into a single fact. Here's the reasoning:
Different Grains: Patient measurements happen at scheduled visits with precise protocols. Adverse events can happen anytime and are reported whenever discovered. Trying to combine these into one fact table would create a confusing grain.
Different Measures: Efficacy measurements are quantitative (HbA1c = 7.2%) while adverse events are qualitative (severity = "Moderate"). The measures don't make sense together.
Different Analysis Patterns: Efficacy analysis focuses on statistical change over time. Safety analysis focuses on event rates and patterns. These require different analytical approaches.
Regulatory Requirements: FDA submissions require separate efficacy and safety sections with different statistical methodologies. Our data model should align with regulatory reporting needs.
Fact Table Design Deep Dive¶
Patient Measurement Fact - The Efficacy Story¶
This fact table captures every measurement taken on every patient at every visit. The grain is: one row per measurement type per patient per visit.
Pre-calculated Change from Baseline: Rather than calculating this at query time, we store baseline_value
and change_from_baseline
. This is crucial because:
- Baseline calculations are complex (what if a patient missed their baseline visit?)
- Statistical software expects pre-calculated changes
- Performance is critical for large datasets with frequent analysis
Data Quality Flags: within_normal_range
and measurement_quality_flag
capture data quality issues that are crucial for regulatory submissions. Outlier values need flagging and explanation.
Why This Grain: We could aggregate to visit level, but measurement-level detail enables: - Analysis of specific biomarkers - Missing data pattern analysis - Quality control at the measurement level
Adverse Event Fact - The Safety Story¶
This fact table captures every adverse event reported during the study. The grain is: one row per adverse event occurrence.
Relationship to Study Drug: related_to_study_drug
is a critical regulatory field. Investigators must assess whether each event is related to the study medication.
Duration Tracking: duration_days
enables analysis of how long events last, which impacts quality of life and drug tolerability.
Regulatory Classification: Different event types have different regulatory reporting requirements. Serious events must be reported to authorities within 24 hours.
Dimension Design for Clinical Context¶
Patient Dimension - The Study Population¶
Unlike typical customer dimensions, patient characteristics are mostly fixed at enrollment. We use SCD Type 1 because: - Demographics don't change during the study - Medical history is captured at baseline - Any changes (like diabetes duration) are captured through measurements, not dimension updates
Baseline Medical Information: baseline_hba1c
, baseline_weight
, and diabetes_duration_years
enable subgroup analysis. Regulatory agencies want to know if the drug works differently in different patient populations.
Enrollment Tracking: enrollment_status
tracks patients through the study lifecycle (Screened, Enrolled, Completed, Withdrawn). This is crucial for statistical analysis and regulatory compliance.
Visit Dimension - The Protocol Framework¶
Clinical trials follow strict visit schedules defined in the protocol. This dimension captures that structure:
Study Day: Calculated from enrollment date, enabling analysis aligned with the study timeline rather than calendar dates.
Visit Windows: Protocols allow visits within acceptable windows (e.g., Week 12 visit can occur between Day 80-90). visit_window
enables protocol compliance analysis.
Primary Endpoint Visits: primary_endpoint_flag
identifies visits where the main efficacy measurement occurs. These are crucial for the primary statistical analysis.
Measurement Type Dimension - Clinical Standardization¶
This dimension standardizes all the different measurements taken in the study:
Clinical Hierarchy: body_system
and measurement_category
organize measurements (Endocrine → Glucose Control → HbA1c).
Reference Ranges: normal_range_min
and normal_range_max
enable automated flagging of abnormal values for safety monitoring.
Study Relevance: primary_endpoint_flag
and safety_parameter_flag
distinguish between different analysis categories for regulatory reporting.
Event Type Dimension - Medical Coding¶
Adverse events must be coded using standardized medical terminology (MedDRA - Medical Dictionary for Regulatory Activities):
Hierarchy: system_organ_class
→ preferred_term
→ event_term
follows MedDRA structure. For example: "Gastrointestinal disorders" → "Nausea" → "Nausea after meals".
Regulatory Classification: expected_event_flag
indicates whether the event is listed in the drug's safety profile. Unexpected events require special reporting.
Complex Analysis Examples¶
Primary Efficacy Analysis¶
-- Mean change from baseline in HbA1c by visit
SELECT
v.visit_name,
v.study_day,
COUNT(DISTINCT pm.patient_key) as patient_count,
AVG(pm.change_from_baseline) as mean_change,
STDDEV(pm.change_from_baseline) as std_deviation
FROM fact_patient_measurement pm
JOIN dim_measurement_type mt ON pm.measurement_type_key = mt.measurement_type_key
JOIN dim_visit v ON pm.visit_key = v.visit_key
WHERE mt.measurement_code = 'HBA1C'
AND v.primary_endpoint_flag = 'Y'
GROUP BY v.visit_name, v.study_day, v.visit_number
ORDER BY v.visit_number;
This query answers the fundamental question: "Is the drug working?" Statistical significance testing would be applied to these results.
Safety Signal Detection¶
-- Adverse events by body system - looking for concerning patterns
SELECT
et.system_organ_class,
COUNT(*) as total_events,
COUNT(DISTINCT ae.patient_key) as patients_affected,
COUNT(DISTINCT ae.patient_key) * 100.0 /
(SELECT COUNT(DISTINCT patient_key) FROM dim_patient WHERE enrollment_status = 'Enrolled') as incidence_rate,
SUM(CASE WHEN ae.related_to_study_drug = 'Y' THEN 1 ELSE 0 END) as drug_related_events
FROM fact_adverse_event ae
JOIN dim_event_type et ON ae.event_type_key = et.event_type_key
GROUP BY et.system_organ_class
ORDER BY incidence_rate DESC;
This analysis helps identify safety