Reveal Solves Challenges in Big Data

Solving Hidden Performance Challenges in Big Data: How Reveal Addressed Data Skew 

By Janardhan Kasireddy 

In large-scale data processing, certain performance issues often go unnoticed until they begin to significantly impact outcomes. One such issue is data skew—a silent but powerful factor that can degrade performance and inflate processing costs. At Reveal, we recently tackled a complex case of data skew and implemented a solution that dramatically improved efficiency. 
 
Here’s how we identified the problem, addressed it using statistical methods, and delivered measurable results for Spark jobs running on Amazon Web Services Elastic MapReduce (AWS EMR). 

Understanding Data Skew 

Data skew occurs when data is unevenly distributed across partitions in a distributed computing environment. When some partitions contain far more data than others, the tasks assigned to process them take considerably longer, while others remain idle. This imbalance can lead to inefficient resource utilization and extended processing times. 
 
This issue is especially prominent in systems like Apache Spark, where performance depends on the even distribution of data across nodes. 

Diagnosing the Problem with Kurtosis and Skewness 

To detect and understand the skew, we analyzed Spark job logs using two statistical metrics: 

  • Kurtosis, which measures the “peakedness” of a distribution. A high kurtosis value often indicates a concentration of extreme values—potential outliers that can distort processing load. 

  • Skewness, which assesses the asymmetry of the distribution. A strongly skewed dataset typically signals that a disproportionate amount of data is concentrated in one area. 

These metrics enabled us to pinpoint the partitions that were causing delays due to imbalanced data. 

Analyzing Kurtosis and Skewed Partitions 

We used the following approach to detect skewed partitions: 

  • Kurtosis Threshold: We defined a threshold for Kurtosis, where values higher than the threshold indicated skewed data. We focused on partitions with high kurtosis, as they showed signs of data concentration. 

  • Task Runtime Analysis: We analyzed the runtime of each stage and correlated long runtimes with high kurtosis values. This analysis helped pinpoint the skewed partitions causing performance bottlenecks. 

Here’s the code we used to parse the event logs and calculate Kurtosis: 

Code used to parse the event logs and calculate Kurtosis

This code parses the Spark event logs, extracting task duration data, and then calculates the Kurtosis to detect any signs of data skew. If the Kurtosis is significantly high or low, it indicates a potential issue with how the data is distributed across partitions.  

Code used to parse the event logs and calculate Kurtosis

Case Study: One Ticker Dominating the Pipeline 

In one example, a Spark job processing financial trading data was significantly delayed. Upon investigation, we found that a partition containing Apple stock trading data (AAPL) had a significantly higher runtime compared to other partitions. While most partitions completed processing within 20 minutes, the AAPL partition took over 64 minutes. 
 
The kurtosis for this partition was 19.68, a clear indicator of a distribution with extreme values. This is called a Leptokurtic distribution — a distribution with heavy tails and more extreme values (outliers). The imbalance caused a bottleneck, delaying the entire job. 

In an ideal scenario, where the data is evenly distributed, the task run times in one of our EMR jobs ranged between 19 and 21 minutes. For example, with the following task data: 

task_data = [1183, 1223, 1260, 1155, 1210, 1287, 1230, 1215, 1160, 1197, 1252, 1220, 1185, 1173, 1245, 1182, 1241, 1265, 1198, 1201, 1190, 1210, 1168, 1250, 1205, 1208, 1234, 1213, 1175, 1181, 1200, 1216, 1236, 1288, 1254, 1196, 1176, 1154] 

The Kurtosis of this dataset shows a Platykurtic distribution, meaning that the data is more spread out with lighter tails compared to a normal distribution. 

Uniform distribution for task runtime

 

Implementing a Targeted Solution: Salting 

To resolve the issue, we applied a technique known as salting, which involves temporarily altering the dominant key (in this case, "AAPL") by appending a random suffix (e.g., "AAPL_0", "AAPL_1"). This allowed the records to be distributed across multiple partitions. 
 
Once processing was complete, we removed the salt and regrouped the records under their original key. 
 
This approach: 

  • Significantly reduced processing time for the skewed partition; 

  • Balanced the workload across nodes; and, 

  • Improved overall job performance without requiring major architectural changes. 

Results and Benefits 

Following implementation, the performance improvements were immediate and measurable. The processing time for the problematic partition dropped from over an hour to under 20 minutes, bringing it in line with the rest of the data pipeline. In addition to accelerating job completion, the optimization reduced costs by making more efficient use of AWS EMR resources. 

Key Takeaways 

Organizations working with distributed systems should consider the following best practices: 

  • Monitor for Data Skew: Regularly analyze performance metrics such as task duration and resource utilization. 

  • Leverage Descriptive Statistics: Tools like kurtosis and skewness can offer valuable insights into how data is distributed. 

  • Apply Targeted Optimizations: Techniques like salting, broadcast joins, and strategic repartitioning can effectively address skew without significant rework. 

  • Focus on Incremental Improvement: Often, small adjustments guided by clear diagnostics yield substantial performance gains. 

Conclusion 

At Reveal, we specialize in identifying and resolving performance bottlenecks in distributed computing environments. Our approach combines statistical analysis with engineering expertise to deliver scalable, cost-effective solutions. By addressing data skew proactively, we help organizations maximize the performance and efficiency of their data platforms. 
 
If your Spark workloads are underperforming or becoming cost-prohibitive, consider the possibility of data skew—and let’s discuss how Reveal can help you optimize. 

Next
Next

Reveal Data Scientists and Partners Lead Conversations at Premier Statistics Conference