Hi, I'm

Md. Ahnaf Tanvir

Data Engineer · GCP · Terraform

Data Engineer with 2+ years developing scalable GCP data infrastructure leveraging BigQuery data warehouse, streaming pipelines, multi-environment Terraform IaC, and full-stack data tools.

View Experience Get In Touch

About Me

I’m a data engineer based in Dhaka, Bangladesh. I build the pipelines, data models, and cloud infrastructure that quietly keep a business running, and most of what I know I picked up by actually shipping things and learning from what broke along the way.

I studied biomedical engineering at Bangladesh University of Engineering & Technology (BUET), working on medical imaging and a thesis on lung nodule detection with deep learning. Somewhere along the way I realized I cared more about the systems that move and shape data than the specific domain it belonged to. That curiosity eventually led me into data engineering.

For the past two years at G-Star, I’ve been working on production-grade data platforms on GCP, building BigQuery models in Dataform, developing streaming pipelines with Pub/Sub and Dataflow, and managing infrastructure across multiple environments with Terraform. I enjoy building systems that are reliable, scalable, and designed to reduce manual effort wherever possible.

Data Platform & Warehousing

BigQuery
Dataform
dbt
Data Modeling

Streaming & Pipelines

Pub/Sub
Dataflow
Apache Beam
Cloud Functions

Platform Engineering

Terraform
Multi-env GCP IaC
GitHub Actions
IAM

Secure Ingestion

Cloud IAP
Custom VPC
Managed SSL
Secret Manager

Professional Experience

Data Engineer

Full-time

G-Star · Dhaka, Bangladesh

Jan 2025 – Present

Data Warehouse Engineering: Engineered and maintained 100+ BigQuery data models across 12 business domains in Google Dataform, processing data from 6+ source systems across dev/uat/prd environments that includes end-to-end integration of a new Order Management System with staging → intermediate → BI layer models for 8+ event types, enabling real-time omnichannel order visibility.
Query Optimization: Refactored legacy models by introducing incremental models, optimizing query performance by up to 80% and reducing cloud costs by up to $200 per day
Streaming Data Pipeline: Designed and delivered an end-to-end GCP event ingestion pipeline from scratch: provisioned Pub/Sub topics, BigQuery sink tables via Terraform, with an Apache Beam / Dataflow streaming job handling XML→JSON transformation and Hive-partitioned GCS storage
Multi-Environment GCP Infrastructure: Managed GCP infrastructure-as-code (Terraform) across 4 environments (dev/uat/prd/backend) for the production data platform, covering BigQuery external tables, IAM role bindings, and service account provisioning.
Self-Service File Ingestion Portal: Built a FastAPI-based ingestion tool with schema-driven validation supporting CSV, Excel, and JSON uploads, routing validated files into date-partitioned GCS paths compatible with BigQuery external tables — deployed on a Terraform-provisioned GCP environment secured with Cloud IAP, managed SSL, and a custom VPC, enabling non-technical business users to load data without pipeline disruption or public access exposure.
Tableau Metadata Pipeline: Developed an automated Python/GraphQL pipeline that extracts Tableau Cloud metadata (dashboards, sheets, full data lineage) nightly and loads results to GCS using GCP Secret Manager for secure credential retrieval — reducing manual lineage-tracing effort for the entire data analytics team and enabling identification of 300+ downstream Tableau dashboard dependencies.
Team Enablement: Led setup and training of the data analytics team on SQLFluff, Dataform, and VSCode; authored SQL style guides and dev environment documentation adopted by the full team, with automated CI/CD SQLFluff checks via GitHub Actions reducing PR review cycles for SQL inconsistencies.

BigQuery
Dataform
Pub/Sub
Dataflow
Apache Beam
Cloud Functions
Terraform
FastAPI
Cloud IAP

Junior Data Engineer

Full-time

G-Star · Dhaka, Bangladesh

Feb 2024 – Dec 2024

Automated Partner Commission Report: Designed and implemented a scheduled data pipeline using Dataform, BigQuery, and Google Workflows to automate monthly partner commission calculations. Collaborated with finance teams on requirements and business logic, producing a reliable dataset used for financial reporting.
PLM & Salesforce Data Pipelines: Built scalable Product Lifecycle Management pipelines using Python, GCS, and BigQuery with fact/dimension modeling in Dataform for analytics on product lifecycle and supplier performance; maintained and optimized the Salesforce data ingestion process using Airbyte.

Dataform
Google Workflows
Python
GCS
BigQuery
Airbyte

Technical Skills

Languages & Tools

Python
SQL
FastAPI
Docker
Git

GCP

BigQuery
Pub/Sub
Dataflow
GCS
Cloud Functions
Cloud IAP
Cloud Load Balancing
Secret Manager

Data Engineering

Apache Beam
dbt
Dataform
Airflow
Google Workflows
Airbyte

IaC & DevOps

Terraform
GitHub Actions
SQLFluff

ML & Libraries

Pandas
NumPy
Scikit-learn
PyTorch
PyArrow

Self-hosted Stack

Apache Kafka
MinIO
DuckDB
Apache Superset

Featured Projects

GCP Streaming Ingestion Pipeline

Production

End-to-end event ingestion: Pub/Sub topics → Apache Beam / Dataflow → Hive-partitioned GCS → BigQuery sink. Fully Terraform-provisioned.

Pub/Sub
Apache Beam
Dataflow
GCS
BigQuery
Terraform

[+] details

Designed and delivered an event-driven ingestion pipeline from scratch as part of a new order-management platform integration.

Provisioned Pub/Sub topics, BigQuery sink tables entirely via Terraform.
Built an Apache Beam / Dataflow streaming job that handles XML → JSON transformation, validates events, and routes failures to a dead-letter topic.
Wrote raw payloads to Hive-partitioned GCS paths for cheap replay alongside the BigQuery sink, with WRITE_APPEND and clustering on high-cardinality keys.

FastAPI Self-Service Ingestion Portal

Production

Schema-validated CSV / Excel / JSON uploads via FastAPI, routed to date-partitioned GCS paths backing BigQuery external tables. Deployed behind Cloud IAP with managed SSL inside a custom VPC.

FastAPI
Pandas
YAML
GCS
BigQuery
Cloud IAP
Terraform

[+] details

A browser-based upload tool that lets business users land data into the warehouse without pipeline disruption or public exposure.

Schema-as-config. Each ingestion target is defined once in YAML (columns, dtypes, nullability, enums, primary keys). Pandas validates every upload before anything hits the warehouse, returning line-numbered failure reports.
Secured perimeter. Cloud IAP fronts the service so access lives in IAM, not app code. Managed SSL handles certs; a custom VPC keeps payloads off the public internet.
Warehouse-ready output. Validated files land in date-partitioned GCS paths compatible with BigQuery external tables — analysts can query loaded data within minutes.
Fully Terraform-provisioned alongside the rest of the GCP environment.

Tableau Metadata Pipeline

Production

Nightly Python / GraphQL job that extracts Tableau Cloud metadata (dashboards, sheets, full lineage) and lands it in GCS. Surfaces 300+ downstream dashboard dependencies for impact analysis.

Python
GraphQL
Tableau Metadata API
GCS
GCP Secret Manager

[+] details

Before this pipeline, “what breaks if we drop this table?” was a manual half-day question.

Pages through the Tableau Metadata API over GraphQL nightly, collecting every workbook, dashboard, sheet, and upstream datasource.
Flattens the graph into a wide GCS / BigQuery dataset (one row per downstream_dashboard ↔ upstream_object pair) alongside a snapshot timestamp for change tracking.
Reduced manual lineage tracing for the data analytics team and surfaced 300+ downstream Tableau dependencies in a single queryable view.
Tableau credentials retrieved at runtime from GCP Secret Manager — rotation is a one-line change.

Self-Hosted Real-Time Data Platform

Personal Project

End-to-end medallion lakehouse on a single laptop — Kafka in KRaft mode at 50 RPS, MinIO for object storage, DuckDB for query, Airflow for orchestration, Superset for BI. Zero cloud dependency.

Apache Kafka
MinIO
DuckDB
Airflow
Superset
Docker
Python
PyArrow

[+] details

A production-grade, end-to-end streaming data platform on self-hosted open-source infrastructure, processing synthetic e-commerce events through a Medallion architecture with zero cloud dependency.

Designed a full Medallion data lakehouse (Bronze → Silver → Gold) stored on MinIO (S3-compatible object storage), processing e-commerce events end-to-end with no cloud dependency.
Built a 3-broker Apache Kafka cluster (KRaft mode, RF=3, 50 RPS) with a Python producer and buffered consumer, both instrumented with Prometheus metrics for observability.
Implemented schema validation, type casting, and a dead-letter path in the Silver transformation layer using Pandas + PyArrow; rejected records are preserved for auditing.
Orchestrated the full daily batch pipeline with Apache Airflow 2.8 (Silver → Gold → DQ checks → DuckDB view refresh) using partition-aware execution.
Containerized 10+ services with Docker Compose profiles (infra vs. jobs), health-check-gated initialization order, and a one-command Makefile startup.

Education

Bachelor of Science in Biomedical Engineering

Bangladesh University of Engineering and Technology (BUET) · Dhaka, Bangladesh

Dean’s List Award for Excellent Scholarly Achievement (2023)
University Stipend Scholarship (2022)

2023 GPA — 3.73 / 4.00

Higher Secondary Certificate

Ananda Mohan College · Mymensingh, Bangladesh

2017 GPA — 5.00 / 5.00

Secondary School Certificate

Mymensingh Zilla School · Mymensingh, Bangladesh

2015 GPA — 5.00 / 5.00

Research

Research Experience

Lung Nodule Detection from CT Scan Images

mHealth Lab , Department of Biomedical Engineering, BUET

Undergraduate thesis under Dr. Taufiq Hasan, Professor, Department of Biomedical Engineering, BUET.
Developed a lung-nodule detection system on volumetric CT scans using deep learning for image segmentation and classification, collaborating with graduate and undergraduate researchers.

Jun 2022 – May 2023