Here's some items to consider when thinking about efficiency, speed of processing cost, and inefficiency related. I just point out that shuffling the data from one stage to another for the purposes of grouping can be a source of inefficiency. That's not as easy to see or detect as say slow input or slow output. Long-running jobs are a symptom, so you might want to measure resources and time between stages, or to run tests on successively larger samples to verify how the pipeline of scaling. Remember to avoid using select wildcard in SQL statements. Filter with where clause in SQL, not limit. Limit, only limits the output not the work it took to get there, where filters on input. We already discussed shuffling. However, a hidden cause of inefficiency can be data skew. The skewed data causes most of the work to be allocated to one worker, and the rest of the workers to sit and wait for that worker to complete. The group by clause works best, when the number of groups is small and the data is easily divided among them. A large number of groups won't scale well. For example, a cardinality sort on an ID could cause increasingly poorer results as the data grows, and the number of groups possible changes. Understand fields you're using for keys when you're using join. Limit the use of user-defined functions, use native SQL whenever possible. There are tools available such as the query explanation map, which shows how processing occurred at each stage. This is a great way to diagnose performance issues and narrow down specific parts of the query that might be the cause. You'll also find overall statistics and ratios that can be instructive. For example, the time the slowest workers spent reading input data, CPU bound, or writing output data which you can compare to the average.