Ashraf Miah
Jan 30, 2022

--

How big is the dataset you were looking at? The advantage of pandas (where it can load the data in memory) is that bits still quicker for exploration as you don't have the distribution overhead.

The difficulty I've had with the new pandas API for spark is that getting the data in one place to generate the plot can take a long time almost irrespective of the cluster size.

--

--

Ashraf Miah

CTO, Data Scientist & Chartered Engineer (MEng CEng EUR ING MRAeS) with over 20 years experience in the Aerospace, Rail & Energy Industry.