Why Choose Spark for Data science?
Even though everyone talks about big data, it typically takes some time in your career before you actually use it. Since Wix.com has well over 160M users, we generate a lot of data, necessitating the need to scale our data operations. For me, this happened faster than I anticipated.
Although there are other possibilities (such as Dask), we chose Spark for two key reasons: (1) It is the most advanced and commonly used Big Data technology. (2) Spark was already supported by the necessary infrastructure.
How to write for pandas in PySpark
Pandas are probably something you’re familiar with, and by familiar, I mean proficient in your mother tongue:)
The talk’s title, Data Wrangling with PySpark for Data Scientists Who Know Pandas, says it all, and it’s excellent.
It would be wise to point out that while mastering the syntax might be a nice place to start, understanding Spark’s operation is essential for creating effective PySpark projects.
Spark is challenging to function properly, but it performs fantastically when it does!
I won’t go into much detail here, but I encourage you to read the MapReduce description in the following article: The Hitchhiker’s guide to handling Big Data with Spark, for a more thorough explanation.
Here, horizontal scaling is the concept that we’re trying to understand.
Starting with vertical scaling is simpler. If we have a pandas code that functions perfectly, but the data grows too large for it, we might switch to a more powerful machine with greater memory and hope it succeeds. This means that even though we scaled vertically, just one machine manages all data concurrently. If, however, we choose to use MapReduce, chunk the data, and give each piece its own machine to process, Horizontal scaling is being used.
Check out the comprehensive data science course by Learnbay to master Spark, MapReduce and other tools used by data scientists.
5 Best Practices for Spark
These five Spark best practices enabled me to scale our project and cut runtime by a factor of ten.
-
Start off modestly by sampling the data.
If we want to make big data work, we must first use a tiny amount of data to confirm that we are moving in the correct way. In my project, I sampled 10% of the data and verified that the pipelines were functioning properly; this allowed me to use the Spark UI’s SQL section to watch the numbers increase during the entire flow without waiting an excessive time for the process to complete.
My experience has shown that scaling up is typically simple if your goal runtime is achieved with a small sample.
-
Have a fundamental understanding of tasks, partitions, and cores
When using Spark, knowing this is arguably the most crucial thing to know:
Maintaining track of the number of jobs in each stage and matching them to the appropriate number of cores in your Spark connection will help you stay aware of the number of partitions you have. Here are a few pointers and generalizations to aid you in doing this (all of these necessitate testing with your case):
- There should be about 2-4 tasks for each core in the task-to-core ratio.
- Depending on the worker’s memory, each partition should be between 200MB and 400MB in size. Adjust the size to suit your needs.
-
Spark debugging
Spark uses lazy evaluation, which means it doesn’t start processing the graph of computing instructions until an action is invoked. The actions show() and count are two examples (),…
This makes it exceedingly challenging to identify the issues or areas of our code that require optimization. One technique I found useful was using df.cache() to break up the code into chunks before using df.count() to make Spark compute the df at each segment.
Now, you can examine the computation of each segment using the Spark UI and identify any issues. It’s vital to remember that employing this technique without the sampling we specified in (1) will probably result in a runtime that is extremely long and difficult to debug.
-
Identifying and addressing skewness
Starting off, let’s define skewness. Our data is partitioned, as we already discussed, and as a result of the modifications, the size of each partition is likely to alter. This can result in a significant size difference between partitions, which gives our data a skewness.
By looking at the stage information in the Spark UI and looking for a substantial discrepancy between the max and median, one can determine the skewness:
This indicates that a couple of the activities were considerably slower than the others.
Why is this bad? Because it could make cores wait inactively while other stages wait for these few tasks.
Ideally, you can adjust the partitioning and handle the skewness directly if you know where it originates. Try the following if you have no clue or no other way to solve it:
Still confused? There are many best data science course available online you can join to understand these concepts better.
- Modifying how many cores there are to each task
As previously noted, by having more jobs than cores, we anticipate that other cores will be kept busy with other activities while the lengthier task runs. Even while this is the case, the ratio (2-4:1) described previously can’t fully account for such a significant difference in work length. Although we can try to make the ratio 10:1 and see if it works, there might be other drawbacks to this strategy.
- Data salting
Data is partitioned again using a random key during salting to ensure that the new partitions are balanced.
-
Issues with Spark’s iterative coding
This was a really difficult one. As previously noted, Spark uses lazy evaluation, so only a computational graph, or DAG, is created when the code is executed. But when you have an iterative process, this strategy can be highly problematic because the DAG reopens the prior iteration and grows extremely large, I mean extremely large. The driver could find this difficult to remember. Because the application is stuck, it is difficult to pinpoint this issue, but for a while, it looks in the Spark UI as if no jobs are running, which is actually the case, before the driver finally fails.
Spark currently has this issue by default, and the solution that I found involved using df.checkpoint() and df.localCheckpoint() every 5 to 6 iterations (find your number by experimenting a bit). This works because checkpoint(), unlike cache(), breaks the lineage and the DAG, saves the results, and restarts from the new checkpoint. The drawback is that you wouldn’t have the complete DAG available to recreate the df if something awful happened.
Summary
As I previously mentioned, mastering how to make Spark work its magic takes time, but these five habits greatly helped my project move forward and sprinkled some Spark magic across my code.If you want to learn data science from scratch, you can can enroll in the best data science courses in India, and become a successful IBM-certified data scientist.