Md. Ahnaf Tanvir

Hi, I'm

Md. Ahnaf Tanvir

Data Engineer · GCP · Terraform

Data Engineer with 2+ years developing scalable GCP data infrastructure leveraging BigQuery data warehouse, streaming pipelines, multi-environment Terraform IaC, and full-stack data tools.

About Me

I’m a data engineer based in Dhaka, Bangladesh. I build the pipelines, data models, and cloud infrastructure that quietly keep a business running, and most of what I know I picked up by actually shipping things and watching what broke.

I studied biomedical engineering at BUET (Bangladesh University of Engineering & Technology), working on medical imaging and a thesis on lung nodule detection with deep learning. Somewhere along the way I realized I cared more about the systems that move and shape data than the specific domain it described, and I went looking for work that let me build those systems end to end. I found that at G-Star, where I’ve spent the last two years working on a production GCP data platform: BigQuery models in Dataform, streaming pipelines on Pub/Sub and Dataflow, multi-environment Terraform.

Data Platform & Warehousing

  • BigQuery
  • Dataform
  • dbt
  • Data Modeling

Streaming & Pipelines

  • Pub/Sub
  • Dataflow
  • Apache Beam
  • Cloud Functions

Platform Engineering

  • Terraform
  • Multi-env GCP IaC
  • GitHub Actions
  • IAM

Secure Ingestion

  • Cloud IAP
  • Custom VPC
  • Managed SSL
  • Secret Manager

Professional Experience

Data Engineer

Full-time

G-Star · Dhaka, Bangladesh

Jan 2025 – Present
  • Data Warehouse Engineering: Engineered and maintained 100+ BigQuery data models across 12 business domains in Google Dataform, processing data from 6+ source systems across dev/uat/prd environments that includes end-to-end integration of a new Order Management System with staging → intermediate → BI layer models for 8+ event types, enabling real-time omnichannel order visibility.
  • Query Optimization: Refactored legacy models by introducing incremental models, optimizing query performance by up to 80% and reducing cloud costs by up to $200 per day
  • Streaming Data Pipeline: Designed and delivered an end-to-end GCP event ingestion pipeline from scratch: provisioned Pub/Sub topics, BigQuery sink tables via Terraform, with an Apache Beam / Dataflow streaming job handling XML→JSON transformation and Hive-partitioned GCS storage
  • Multi-Environment GCP Infrastructure: Managed GCP infrastructure-as-code (Terraform) across 4 environments (dev/uat/prd/backend) for the production data platform, covering BigQuery external tables, IAM role bindings, and service account provisioning.
  • Self-Service File Ingestion Portal: Built a FastAPI-based ingestion tool with schema-driven validation supporting CSV, Excel, and JSON uploads, routing validated files into date-partitioned GCS paths compatible with BigQuery external tables — deployed on a Terraform-provisioned GCP environment secured with Cloud IAP, managed SSL, and a custom VPC, enabling non-technical business users to load data without pipeline disruption or public access exposure.
  • Tableau Metadata Pipeline: Developed an automated Python/GraphQL pipeline that extracts Tableau Cloud metadata (dashboards, sheets, full data lineage) nightly and loads results to GCS using GCP Secret Manager for secure credential retrieval — reducing manual lineage-tracing effort for the entire data analytics team and enabling identification of 300+ downstream Tableau dashboard dependencies.
  • Team Enablement: Led setup and training of the data analytics team on SQLFluff, Dataform, and VSCode; authored SQL style guides and dev environment documentation adopted by the full team, with automated CI/CD SQLFluff checks via GitHub Actions reducing PR review cycles for SQL inconsistencies.
  • BigQuery
  • Dataform
  • Pub/Sub
  • Dataflow
  • Apache Beam
  • Cloud Functions
  • Terraform
  • FastAPI
  • Cloud IAP

Junior Data Engineer

Full-time

G-Star · Dhaka, Bangladesh

Feb 2024 – Dec 2024
  • Automated Partner Commission Report: Designed and implemented a scheduled data pipeline using Dataform, BigQuery, and Google Workflows to automate monthly partner commission calculations. Collaborated with finance teams on requirements and business logic, producing a reliable dataset used for financial reporting.
  • PLM & Salesforce Data Pipelines: Built scalable Product Lifecycle Management pipelines using Python, GCS, and BigQuery with fact/dimension modeling in Dataform for analytics on product lifecycle and supplier performance; maintained and optimized the Salesforce data ingestion process using Airbyte.
  • Dataform
  • Google Workflows
  • Python
  • GCS
  • BigQuery
  • Airbyte

Technical Skills

Languages & Tools

  • Python
  • SQL
  • FastAPI
  • Docker
  • Git

GCP

  • BigQuery
  • Pub/Sub
  • Dataflow
  • GCS
  • Cloud Functions
  • Cloud IAP
  • Cloud Load Balancing
  • Secret Manager

Data Engineering

  • Apache Beam
  • dbt
  • Dataform
  • Airflow
  • Google Workflows
  • Airbyte

IaC & DevOps

  • Terraform
  • GitHub Actions
  • SQLFluff

ML & Libraries

  • Pandas
  • NumPy
  • Scikit-learn
  • PyTorch
  • PyArrow

Self-hosted Stack

  • Apache Kafka
  • MinIO
  • DuckDB
  • Apache Superset

Featured Projects

GCP Streaming Ingestion Pipeline

Production

End-to-end event ingestion: Pub/Sub topics → Apache Beam / Dataflow → Hive-partitioned GCS → BigQuery sink. Fully Terraform-provisioned.

  • Pub/Sub
  • Apache Beam
  • Dataflow
  • GCS
  • BigQuery
  • Terraform
[+] details

Designed and delivered an event-driven ingestion pipeline from scratch as part of a new order-management platform integration.

  • Provisioned Pub/Sub topics, BigQuery sink tables entirely via Terraform.
  • Built an Apache Beam / Dataflow streaming job that handles XML → JSON transformation, validates events, and routes failures to a dead-letter topic.
  • Wrote raw payloads to Hive-partitioned GCS paths for cheap replay alongside the BigQuery sink, with WRITE_APPEND and clustering on high-cardinality keys.

FastAPI Self-Service Ingestion Portal

Production

Schema-validated CSV / Excel / JSON uploads via FastAPI, routed to date-partitioned GCS paths backing BigQuery external tables. Deployed behind Cloud IAP with managed SSL inside a custom VPC.

  • FastAPI
  • Pandas
  • YAML
  • GCS
  • BigQuery
  • Cloud IAP
  • Terraform
[+] details

A browser-based upload tool that lets business users land data into the warehouse without pipeline disruption or public exposure.

  • Schema-as-config. Each ingestion target is defined once in YAML (columns, dtypes, nullability, enums, primary keys). Pandas validates every upload before anything hits the warehouse, returning line-numbered failure reports.
  • Secured perimeter. Cloud IAP fronts the service so access lives in IAM, not app code. Managed SSL handles certs; a custom VPC keeps payloads off the public internet.
  • Warehouse-ready output. Validated files land in date-partitioned GCS paths compatible with BigQuery external tables — analysts can query loaded data within minutes.
  • Fully Terraform-provisioned alongside the rest of the GCP environment.

Tableau Metadata Pipeline

Production

Nightly Python / GraphQL job that extracts Tableau Cloud metadata (dashboards, sheets, full lineage) and lands it in GCS. Surfaces 300+ downstream dashboard dependencies for impact analysis.

  • Python
  • GraphQL
  • Tableau Metadata API
  • GCS
  • GCP Secret Manager
[+] details

Before this pipeline, “what breaks if we drop this table?” was a manual half-day question.

  • Pages through the Tableau Metadata API over GraphQL nightly, collecting every workbook, dashboard, sheet, and upstream datasource.
  • Flattens the graph into a wide GCS / BigQuery dataset (one row per downstream_dashboard ↔ upstream_object pair) alongside a snapshot timestamp for change tracking.
  • Reduced manual lineage tracing for the data analytics team and surfaced 300+ downstream Tableau dependencies in a single queryable view.
  • Tableau credentials retrieved at runtime from GCP Secret Manager — rotation is a one-line change.

Self-Hosted Real-Time Data Platform

Personal Project

End-to-end medallion lakehouse on a single laptop — Kafka in KRaft mode at 50 RPS, MinIO for object storage, DuckDB for query, Airflow for orchestration, Superset for BI. Zero cloud dependency.

  • Apache Kafka
  • MinIO
  • DuckDB
  • Airflow
  • Superset
  • Docker
  • Python
  • PyArrow
[+] details

A production-grade, end-to-end streaming data platform on self-hosted open-source infrastructure, processing synthetic e-commerce events through a Medallion architecture with zero cloud dependency.

  • Designed a full Medallion data lakehouse (Bronze → Silver → Gold) stored on MinIO (S3-compatible object storage), processing e-commerce events end-to-end with no cloud dependency.
  • Built a 3-broker Apache Kafka cluster (KRaft mode, RF=3, 50 RPS) with a Python producer and buffered consumer, both instrumented with Prometheus metrics for observability.
  • Implemented schema validation, type casting, and a dead-letter path in the Silver transformation layer using Pandas + PyArrow; rejected records are preserved for auditing.
  • Orchestrated the full daily batch pipeline with Apache Airflow 2.8 (Silver → Gold → DQ checks → DuckDB view refresh) using partition-aware execution.
  • Containerized 10+ services with Docker Compose profiles (infra vs. jobs), health-check-gated initialization order, and a one-command Makefile startup.

Education

Bachelor of Science in Biomedical Engineering

Bangladesh University of Engineering and Technology (BUET) · Dhaka, Bangladesh

  • Dean’s List Award for Excellent Scholarly Achievement (2023)
  • University Stipend Scholarship (2022)
2023 GPA — 3.73 / 4.00

Higher Secondary Certificate

Ananda Mohan College · Mymensingh, Bangladesh

2017 GPA — 5.00 / 5.00

Secondary School Certificate

Mymensingh Zilla School · Mymensingh, Bangladesh

2015 GPA — 5.00 / 5.00

Research

Research Experience

Lung Nodule Detection from CT Scan Images

mHealth Lab , Department of Biomedical Engineering, BUET

  • Undergraduate thesis under Dr. Taufiq Hasan, Professor, Department of Biomedical Engineering, BUET.
  • Developed a lung-nodule detection system on volumetric CT scans using deep learning for image segmentation and classification, collaborating with graduate and undergraduate researchers.
Jun 2022 – May 2023

Get In Touch