Ashraf Miah
May 6, 2023

--

You mentioned it briefly but this is like an apples and pears comparison. You're trying to take a distributed computational framework and use it on a job that can run on a single node anyway.

In that regard, why use DBT and Duckdb at all. Use any of the single node optimised tools from Pandas, Polars or Dask.

The real conclusion appears to be use the right tool for the job or don't use Spark where it's not needed.

--

--

Ashraf Miah

CTO, Data Scientist & Chartered Engineer (MEng CEng EUR ING MRAeS) with over 20 years experience in the Aerospace, Rail & Energy Industry.