Building Serverless ETL Pipelines on AWS¶

In this post, I'll share my experience building a serverless ETL pipeline to extract and map medical entities using various AWS services including Lambda, SNS, Textract, Translate, Comprehend Medical, Glue, and Athena.

The Challenge¶

Working at Genentech, I faced the challenge of extracting medical entities from multilingual clinical notes that came in various formats including handwritten images, PDFs, and DOCX files. These entities needed to be mapped to standardized ICD-10 codes for analysis.

Architecture Overview¶

Our solution leveraged a serverless architecture with these key components:

Document Ingestion: AWS S3 for storage with event triggers
Text Extraction: AWS Textract for converting documents to machine-readable text
Translation: AWS Translate for handling multilingual content
Entity Recognition: AWS Comprehend Medical for identifying medical terms
Code Mapping: Custom Lambda function to map to ICD-10 codes
Data Storage: Processed results stored in S3 and cataloged in Glue
Analysis: Athena for SQL-based querying of the processed data

Implementation Details¶

[Content continues with technical implementation details]

Benefits and Results¶

The serverless approach provided several advantages:

Cost Efficiency: Pay-per-use model reduced operational costs by 40%
Scalability: Automatic scaling handled varying document loads
Maintenance: Reduced operational overhead compared to EC2-based solutions
Accuracy: Achieved 92% accuracy in entity recognition and mapping

Lessons Learned¶

[Content continues with lessons learned and best practices]

Conclusion¶

Serverless ETL pipelines offer significant advantages for processing unstructured medical data at scale. By leveraging AWS's managed services, we created a solution that was both cost-effective and powerful.

Feel free to reach out if you have questions about implementing similar solutions in your organization!