Eugene de Beste

Infrastructure & Platform Engineering Leader · Cloud Platforms, Reliability, Automation, GPU/HPC

“The right man in the wrong place can make all the difference in the world.”
— G-man, Half-Life 2.

Professional Summary

Infrastructure and platform engineer turned operations leader, with nearly a decade across production cloud, research computing, and HPC. Comfortable moving between strategy and the command line: OpenStack private cloud, Kubernetes/GitOps, observability, automation, networking, storage, and GPU platforms.

Built the automation and observability backbone for a multi-region GPU cloud, then shaped the infrastructure operations function around clear ownership, practical support enablement, incident response, and reliable delivery. Brings a production-safety mindset (dry runs, idempotency, staged rollouts, least privilege, GitOps, and strong documentation) without losing sight of the teams operating the systems.

Core Skills / Technologies

  • Cloud / Platform: OpenStack, Kolla-Ansible, Kubernetes, Kubespray, Cilium, Argo CD, ApplicationSet, Helm, Ceph, MAAS, NetBox, InfraHub, PowerDNS
  • Automation / IaC / Tooling: Ansible, Python, Go, GitOps, CI/CD, Windmill, Packer, Terraform
  • Systems / Virtualisation: Linux, QEMU, KVM, libvirt, OVMF/EDK2, Open vSwitch, SR-IOV
  • GPU / HPC: H100, H200, B200, GH200, GPU virtualisation, NUMA/hugepages, InfiniBand, RoCEv2, GPUDirect RDMA, DCGM
  • Networking / Integration: VLANs, VRRP, MetalLB, BGP, FRR, L2/L3 fabrics, Redfish/IPMI/SNMP
  • Observability / Ops: Prometheus, VictoriaMetrics, Grafana, Alertmanager, incident response & RCA, runbooks, capacity planning, SRE
  • Security / Secrets: IAM & access scopes, OIDC / Authentik, Sealed Secrets, least-privilege design
  • Working style: Production safety, documentation, cross-team enablement, AI-assisted engineering workflows

Acronyms

CSIR
Council for Scientific and Industrial Research
CHPC
Centre for High Performance Computing
SANBI
South African National Bioinformatics Institute
UWC
University of the Western Cape
UCT
University of Cape Town
HISP
Health Information Systems Programme
NICD
National Institute for Communicable Diseases

Professional Experience

NexGen Cloud

Infrastructure Operations Manager (Secondment), previously Head of Infrastructure Operations

Multi-region GPU cloud (OpenStack-based). Title updated during internal restructuring; scope unchanged.

  • Built the infrastructure operations function around a clear operating model, escalation paths, and permission/IAM scopes, separating L1/L2 support from infrastructure engineering and reducing repeat escalations into the engineering team.
  • Owned observability platform strategy: designed a unified monitoring architecture feeding a new Network Operations Centre (NOC) and led build-vs-buy / total-cost-of-ownership selection (open-source Prometheus/VictoriaMetrics + DCGM vs commercial), scaling the approach toward a large-scale NVIDIA B200 SuperPOD region.
  • Primary engineer for centralised bare-metal observability, building a NetBox-driven stack where in-region collectors feed a central VictoriaMetrics and Grafana deployment with an alert suite tuned for signal over noise. Currently centralising monitoring across the EU bare-metal region (two clusters).
  • Built CX and L2 enablement across OpenStack, Linux, and networking: training tracks, runbooks, decision trees, and scoped self-service workflows.
  • Coordinated data-centre, hardware, and partner engagement, and led the observability procurement process through vendor evaluation, scenario presentation, and partner justification.
  • Established incident-response and root-cause-analysis (RCA) practice; led major-incident response and authored RCAs for customer-impacting outages.
Senior Infrastructure Engineer
  • Diagnosed and remediated deep GPU virtualisation issues across H100, H200, B200, and GH200 fleets, including NUMA, CPU-pinning, and hugepage scheduling, plus a libvirt XML-marker fix that resolved a modify-restart event race.
  • Enabled GPUDirect RDMA over RoCE/InfiniBand inside VMs (PCIe relaxed-ordering, ATS/ACS, IOMMU), and ran a fleet-wide firmware audit after detecting a faulty H100 VBIOS.
  • Cut large-BAR GPU VM boot times with OVMF/EDK2 and libvirt XML changes on pre-6.14 kernels, a projected 80%+ reduction in affected boots.
  • Built the GPU stock reporter, the platform's capacity source of truth: an event-driven service with leader election, a health-gating killswitch, and ground-truth reconciliation against the Nova database so unhealthy GPUs stay out of sellable capacity.
  • Standardised OpenStack region deployment by building custom Ansible and Python tooling around Kolla-Ansible, supporting major platform releases and accelerating node and region bring-up across four regions.
  • Designed and ran the multi-region Kubernetes platform for internal services on Kubespray, Argo CD / ApplicationSet GitOps, and Cilium, spanning both BGP/L3 and L2-only fabrics and integrating with existing VRRP and MetalLB patterns.
  • Productionised Windmill as an audited, least-privilege self-service automation platform with 10,000+ lines of Python, consolidating six support workflows into two idempotent, state-tracked flows.
  • Built supporting platform tooling: a plugin-based NetBox sync tool reconciling inventory and DNS across NetBox, MAAS, and PowerDNS, and a highly available billing-metering stack (Gnocchi + Ceph + MySQL InnoDB Cluster + Redis Sentinel + HAProxy/BGP).
  • Migrated NFS workloads to Ceph RBD and tuned RBD performance for VMs; planned and executed largely automated migration of 300+ virtual machines with hands-on workload troubleshooting.

CSIR / CHPC

Senior Cloud and HPC Technologist II
  • Led the OpenStack Research Cloud and ACE Lab at the CHPC, part of the CSIR, as R&D platforms for cloud/HPC experimentation, while also contributing to Sebowa production-cloud operations serving hundreds of researchers.
  • Deployed the CHPC Pretoria region for Sebowa and trained the local team on OpenStack operations, support, and policy alignment.
  • Architected and operated multi-petabyte Ceph storage for HPC workloads, building monitoring and inventory systems that improved operational visibility.
  • Co-led the South African Student Cluster Competition programme, mentoring teams and developing training material for cohorts progressing to the ISC Student Cluster Challenge.

UCT / ILIFU

External Consultant
  • Planned and delivered a private OpenStack cloud for astronomy and bioinformatics with multi-petabyte Ceph storage.
  • Prototyped the platform on test hardware, then moved it into production with OpenStack, Ceph, and Manila evaluation for file services.
  • Supported handover and ongoing technical guidance during deployment, including cloud and storage troubleshooting.

SANBI / UWC

Systems Developer
  • Trained and supported student teams for the CHPC Student Cluster Competition, redesigning delivery for fully remote operation during the pandemic.
  • Migrated in-house VM management onto OpenStack and supported HISP, NICD, and UWC HPC deployments across cloud and bare-metal environments.
  • Automated bare-metal and cloud builds with MAAS, PXE, Ansible, and Terraform, introducing FreeIPA, monitoring, and change-management practices.
  • Supported Ceph storage, networking, and helpdesk operations across research and IT teams.

Selected Projects

GPU Support Diagnostics

  • Built a Go collector and a React/TypeScript diagnostics dashboard for L2 support, with GPU log analysis (NVRM/uvm/Xid) and automatic bare-metal-vs-VM detection.
  • Tuned detection to cut false positives, and designed a secure, no-auto-upload run path so customer diagnostics stay under operator control.

Open-source GPU PCIe Hotfix

  • Published a remediation for a recurring GPU "falling off the bus" PCIe fault encountered in production.

African Pathogen Archive

  • Helped secure the CHPC and SANBI MoU and shaped the Infrastructure Automation Lead role.
  • Built Flux CD automation for repeatable deployment across Kubernetes and OpenStack.

Education

SANBI / UWC

M.Sc. Bioinformatics (Cum Laude)

UCT

B.Sc. Hons Information Technology (Cum Laude)

UWC

B.Sc. Computer Science (Cum Laude)

Scholarly Impact & Recognition

Publications

Awards

First Place Overall Prize for the ISC'14 Student Cluster ChallengeInternational Supercomputing Conference, Leipzig, Germany
First Place for the CHPC Student Cluster Competition 2013Centre for High Performance Computing, Council for Scientific and Industrial Research