Data Engineer
Posted 2025-08-23
Remote, USA
Full Time
Immediate Start
<b>Description</b><br><p>Fetcherr, experts in deep learning, algo, e-commerce, and digitization, is disrupting traditional systems with its cutting-edge AI technology. At its core is the Large Market Model (LMM), an adaptable AI engine that forecasts demand and market trends with precision, empowering real-time decision-making. Specializing initially in the airline industry, Fetcherr aims to revolutionize industries with dynamic AI-driven solutions.</p><p>Fetcher is seeking a <strong>Data Engineer </strong>to build large-scale optimized data pipelines using cutting-edge technology and tools. We're looking for someone with advanced Python skills and a deep understanding of memory and CPU optimization in distributed environments. This is a high-impact role with responsibilities that directly influence the company's strategic decisions and data-driven initiatives.</p><p><strong>Key Responsibilities:</strong></p><ul><li>Design and build scalable, cross-client data pipelines and transformation workflows using modern ELT tools, ensuring high performance, reusability, and cost-efficiency across diverse data products. Leverage orchestration frameworks like Dagster to manage dependencies, retries, and monitoring.</li><li>Develop and operate distributed data processing systems that handle large-scale workloads efficiently, adapting to dynamic data volumes and infrastructure constraints. Apply frameworks such as Dask or Spark to unlock parallelism and optimize compute resource utilization.</li><li>Deliver robust, maintainable Python solutions by applying sound software engineering principles, including modular architecture, reusable components, and shared libraries. Ensure code quality and operational resilience through CI/CD best practices and containerized deployments.</li><li>Collaborate with data scientists, engineers, and product teams to deliver validated, analytics-ready data that aligns with business requirements. Support team-wide adoption of data modeling standards and efficient data access patterns.</li><li>Proactively safeguard data quality and reliability by implementing anomaly detection, validation frameworks, and statistical or ML-based techniques to forecast trends and catch regressions early. Enforce backward compatibility and data contract integrity across pipeline changes.</li><li>Document workflows, interfaces, and architectural decisions in a clear and structured manner to support long-term maintainability. Maintain up-to-date data contracts, system runbooks, and onboarding guides for effective cross-team collaboration.</li></ul><br> <b>Requirements</b><br><p><strong>You’ll be a great fit if you have...</strong> </p><ul><li>7+ years of hands-on experience building and maintaining production-grade data pipelines at scale</li><li>Expertise in Python, with strong grasp of data structures, performance optimization, and modern data processing libraries (e.g. pandas, NumPy)</li><li>Practical experience with distributed computing frameworks such as Dask or Spark, including performance tuning and memory management</li><li>Proficiency in SQL, with a deep understanding of query optimization, analytical functions, and cost-efficient query design</li><li>Experience designing and managing transformation logic using dbt, with a focus on modular development, testability, and scalable performance across large datasets</li><li>Strong understanding of ETL/ELT architecture, data modeling principles, and data validation</li><li>Familiarity with cloud platforms (e.g. GCP, AWS) and modern data storage formats (e.g. Parquet, BigQuery, Delta Lake)</li><li>Experience with CI/CD workflows, Docker, and orchestrating workloads in Kubernetes</li></ul><p><strong>Nice to Have :</strong></p><ul><li>Experience with Dagster or similar workflow orchestration tools</li><li>Familiarity with automated testing frameworks for data workflows, such as pytest, Great Expectations, Pandera, Behave, Hamilton, dbt tests or similar</li><li>Deep interest in performance optimization and vectorized computation, especially in Dask/pandas -based pipelines</li><li>Ability to design cross-client, cost-efficient solutions that prioritize scalability, modularity, and minimal resource consumption</li><li>Strong grounding in software architecture best practices, including adherence to SOLID, YANGI, KISS, DRY, CoC, OOP, CoI, LOD principles and code reuse through shared libraries (<strong>strong pro)</strong></li></ul><br>