
Share Latest Nov-2022 Associate-Developer-Apache-SparkTest Practice Test Questions, Exam Dumps
Positive Aspects of Valid Dumps Associate-Developer-Apache-Spark Exam Dumps!
Certification Topics of Databricks Associate Developer Apache Spark Exam?
Then, Spark Architecture: Applied understanding (11%)
To begin with, Spark Architecture: Conceptual understanding (17%)
Lastly, Spark DataFrame API Applications (72%)
NEW QUESTION 80
Which of the following describes a narrow transformation?
- A. A narrow transformation is an operation in which no data is exchanged across the cluster.
- B. A narrow transformation is a process in which data from multiple RDDs is used.
- C. A narrow transformation is an operation in which data is exchanged across the cluster.
- D. A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like 16-bit or 8-bit float variables.
- E. narrow transformation is an operation in which data is exchanged across partitions.
Answer: A
Explanation:
Explanation
A narrow transformation is an operation in which no data is exchanged across the cluster.
Correct! In narrow transformations, no data is exchanged across the cluster, since these transformations do not require any data from outside of the partition they are applied on. Typical narrow transformations include filter, drop, and coalesce.
A narrow transformation is an operation in which data is exchanged across partitions.
No, that would be one definition of a wide transformation, but not of a narrow transformation. Wide transformations typically cause a shuffle, in which data is exchanged across partitions, executors, and the cluster.
A narrow transformation is an operation in which data is exchanged across the cluster.
No, see explanation just above this one.
A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like
16-bit or 8-bit float variables.
No, type conversion has nothing to do with narrow transformations in Spark.
A narrow transformation is a process in which data from multiple RDDs is used.
No. A resilient distributed dataset (RDD) can be described as a collection of partitions. In a narrow transformation, no data is exchanged between partitions. Thus, no data is exchanged between RDDs.
One could say though that a narrow transformation and, in fact, any transformation results in a new RDD being created. This is because a transformation results in a change to an existing RDD (RDDs are the foundation of other Spark data structures, like DataFrames). But, since RDDs are immutable, a new RDD needs to be created to reflect the change caused by the transformation.
More info: Spark Transformation and Action: A Deep Dive | by Misbah Uddin | CodeX | Medium
NEW QUESTION 81
Which of the following code blocks returns the number of unique values in column storeId of DataFrame transactionsDf?
- A. transactionsDf.distinct().select("storeId").count()
- B. transactionsDf.select("storeId").dropDuplicates().count()
- C. transactionsDf.select(count("storeId")).dropDuplicates()
- D. transactionsDf.dropDuplicates().agg(count("storeId"))
- E. transactionsDf.select(distinct("storeId")).count()
Answer: B
Explanation:
Explanation
transactionsDf.select("storeId").dropDuplicates().count()
Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column.
transactionsDf.select(count("storeId")).dropDuplicates()
No. transactionsDf.select(count("storeId")) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context.
transactionsDf.dropDuplicates().agg(count("storeId"))
Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates instead.
transactionsDf.distinct().select("storeId").count()
Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count not represent the number of unique values in that column.
transactionsDf.select(distinct("storeId")).count()
False. There is no distinct method in pyspark.sql.functions.
NEW QUESTION 82
Which of the following code blocks reads all CSV files in directory filePath into a single DataFrame, with column names defined in the CSV file headers?
Content of directory filePath:
1._SUCCESS
2._committed_2754546451699747124
3._started_2754546451699747124
4.part-00000-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-298-1-c000.csv.gz
5.part-00001-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-299-1-c000.csv.gz
6.part-00002-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-300-1-c000.csv.gz
7.part-00003-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-301-1-c000.csv.gz spark.option("header",True).csv(filePath)
- A. spark.read.format("csv").option("header",True).load(filePath)
- B. spark.read.load(filePath)
- C. spark.read.format("csv").option("header",True).option("compression","zip").load(filePath)
- D. spark.read().option("header",True).load(filePath)
Answer: A
Explanation:
Explanation
The files in directory filePath are partitions of a DataFrame that have been exported using gzip compression.
Spark automatically recognizes this situation and imports the CSV files as separate partitions into a single DataFrame. It is, however, necessary to specify that Spark should load the file headers in the CSV with the header option, which is set to False by default.
NEW QUESTION 83
The code block displayed below contains an error. The code block should trigger Spark to cache DataFrame transactionsDf in executor memory where available, writing to disk where insufficient executor memory is available, in a fault-tolerant way. Find the error.
Code block:
transactionsDf.persist(StorageLevel.MEMORY_AND_DISK)
- A. Caching is not supported in Spark, data are always recomputed.
- B. The code block uses the wrong operator for caching.
- C. The DataFrameWriter needs to be invoked.
- D. The storage level is inappropriate for fault-tolerant storage.
- E. Data caching capabilities can be accessed through the spark object, but not through the DataFrame API.
Answer: D
Explanation:
Explanation
The storage level is inappropriate for fault-tolerant storage.
Correct. Typically, when thinking about fault tolerance and storage levels, you would want to store redundant copies of the dataset. This can be achieved by using a storage level such as StorageLevel.MEMORY_AND_DISK_2.
The code block uses the wrong command for caching.
Wrong. In this case, DataFrame.persist() needs to be used, since this operator supports passing a storage level.
DataFrame.cache() does not support passing a storage level.
Caching is not supported in Spark, data are always recomputed.
Incorrect. Caching is an important component of Spark, since it can help to accelerate Spark programs to great extent. Caching is often a good idea for datasets that need to be accessed repeatedly.
Data caching capabilities can be accessed through the spark object, but not through the DataFrame API.
No. Caching is either accessed through DataFrame.cache() or DataFrame.persist().
The DataFrameWriter needs to be invoked.
Wrong. The DataFrameWriter can be accessed via DataFrame.write and is used to write data to external data stores, mostly on disk. Here, we find keywords such as "cache" and "executor memory" that point us away from using external data stores. We aim to save data to memory to accelerate the reading process, since reading from disk is comparatively slower. The DataFrameWriter does not write to memory, so we cannot use it here.
More info: Best practices for caching in Spark SQL | by David Vrba | Towards Data Science
NEW QUESTION 84
Which of the following describes the role of the cluster manager?
- A. The cluster manager schedules tasks on the cluster in client mode.
- B. The cluster manager schedules tasks on the cluster in local mode.
- C. The cluster manager allocates resources to the DataFrame manager.
- D. The cluster manager allocates resources to Spark applications and maintains the executor processes in remote mode.
- E. The cluster manager allocates resources to Spark applications and maintains the executor processes in client mode.
Answer: E
Explanation:
Explanation
The cluster manager allocates resources to Spark applications and maintains the executor processes in client mode.
Correct. In cluster mode, the cluster manager is located on a node other than the client machine. From there it starts and ends executor processes on the cluster nodes as required by the Spark application running on the Spark driver.
The cluster manager allocates resources to Spark applications and maintains the executor processes in remote mode.
Wrong, there is no "remote" execution mode in Spark. Available execution modes are local, client, and cluster.
The cluster manager allocates resources to the DataFrame manager
Wrong, there is no "DataFrame manager" in Spark.
The cluster manager schedules tasks on the cluster in client mode.
No, in client mode, the Spark driver schedules tasks on the cluster - not the cluster manager.
The cluster manager schedules tasks on the cluster in local mode.
Wrong: In local mode, there is no "cluster". The Spark application is running on a single machine, not on a cluster of machines.
NEW QUESTION 85
Which of the following describes how Spark achieves fault tolerance?
- A. Spark helps fast recovery of data in case of a worker fault by providing the MEMORY_AND_DISK storage level option.
- B. If an executor on a worker node fails while calculating an RDD, that RDD can be recomputed by another executor using the lineage.
- C. Spark is only fault-tolerant if this feature is specifically enabled via the spark.fault_recovery.enabled property.
- D. Due to the mutability of DataFrames after transformations, Spark reproduces them using observed lineage in case of worker node failure.
- E. Spark builds a fault-tolerant layer on top of the legacy RDD data system, which by itself is not fault tolerant.
Answer: B
Explanation:
Explanation
Due to the mutability of DataFrames after transformations, Spark reproduces them using observed lineage in case of worker node failure.
Wrong - Between transformations, DataFrames are immutable. Given that Spark also records the lineage, Spark can reproduce any DataFrame in case of failure. These two aspects are the key to understanding fault tolerance in Spark.
Spark builds a fault-tolerant layer on top of the legacy RDD data system, which by itself is not fault tolerant.
Wrong. RDD stands for Resilient Distributed Dataset and it is at the core of Spark and not a "legacy system".
It is fault-tolerant by design.
Spark helps fast recovery of data in case of a worker fault by providing the MEMORY_AND_DISK storage level option.
This is not true. For supporting recovery in case of worker failures, Spark provides "_2", "_3", and so on, storage level options, for example MEMORY_AND_DISK_2. These storage levels are specifically designed to keep duplicates of the data on multiple nodes. This saves time in case of a worker fault, since a copy of the data can be used immediately, vs. having to recompute it first.
Spark is only fault-tolerant if this feature is specifically enabled via the spark.fault_recovery.enabled property.
No, Spark is fault-tolerant by design.
NEW QUESTION 86
Which of the following describes the role of tasks in the Spark execution hierarchy?
- A. Stages with narrow dependencies can be grouped into one task.
- B. Tasks with wide dependencies can be grouped into one stage.
- C. Tasks are the smallest element in the execution hierarchy.
- D. Tasks are the second-smallest element in the execution hierarchy.
- E. Within one task, the slots are the unit of work done for each partition of the data.
Answer: C
Explanation:
Explanation
Stages with narrow dependencies can be grouped into one task.
Wrong, tasks with narrow dependencies can be grouped into one stage.
Tasks with wide dependencies can be grouped into one stage.
Wrong, since a wide transformation causes a shuffle which always marks the boundary of a stage. So, you cannot bundle multiple tasks that have wide dependencies into a stage.
Tasks are the second-smallest element in the execution hierarchy.
No, they are the smallest element in the execution hierarchy.
Within one task, the slots are the unit of work done for each partition of the data.
No, tasks are the unit of work done per partition. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.
NEW QUESTION 87
The code block shown below should store DataFrame transactionsDf on two different executors, utilizing the executors' memory as much as possible, but not writing anything to disk. Choose the answer that correctly fills the blanks in the code block to accomplish this.
1.from pyspark import StorageLevel
2.transactionsDf.__1__(StorageLevel.__2__).__3__
- A. 1. persist
2. DISK_ONLY_2
3. count() - B. 1. persist
2. MEMORY_ONLY_2
3. select() - C. 1. cache
2. MEMORY_ONLY_2
3. count() - D. 1. cache
2. DISK_ONLY_2
3. count() - E. 1. persist
2. MEMORY_ONLY_2
3. count()
Answer: E
Explanation:
Explanation
Correct code block:
from pyspark import StorageLevel
transactionsDf.persist(StorageLevel.MEMORY_ONLY_2).count()
Only persist takes different storage levels, so any option using cache() cannot be correct. persist() is evaluated lazily, so an action needs to follow this command. select() is not an action, but count() is - so all options using select() are incorrect.
Finally, the question states that "the executors' memory should be utilized as much as possible, but not writing anything to disk". This points to a MEMORY_ONLY storage level. In this storage level, partitions that do not fit into memory will be recomputed when they are needed, instead of being written to disk, as with the storage option MEMORY_AND_DISK. Since the data need to be duplicated across two executors, _2 needs to be appended to the storage level.
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 88
Which of the following code blocks returns a one-column DataFrame of all values in column supplier of DataFrame itemsDf that do not contain the letter X? In the DataFrame, every value should only be listed once.
Sample of DataFrame itemsDf:
1.+------+--------------------+--------------------+-------------------+
2.|itemId| itemName| attributes| supplier|
3.+------+--------------------+--------------------+-------------------+
4.| 1|Thick Coat for Wa...|[blue, winter, cozy]|Sports Company Inc.|
5.| 2|Elegant Outdoors ...|[red, summer, fre...| YetiX|
6.| 3| Outdoors Backpack|[green, summer, t...|Sports Company Inc.|
7.+------+--------------------+--------------------+-------------------+
- A. itemsDf.filter(~col('supplier').contains('X')).select('supplier').distinct()
- B. itemsDf.filter(!col('supplier').contains('X')).select(col('supplier')).unique()
- C. itemsDf.select(~col('supplier').contains('X')).distinct()
- D. itemsDf.filter(not(col('supplier').contains('X'))).select('supplier').unique()
- E. itemsDf.filter(col(supplier).not_contains('X')).select(supplier).distinct()
Answer: A
Explanation:
Explanation
Output of correct code block:
+-------------------+
| supplier|
+-------------------+
|Sports Company Inc.|
+-------------------+
Key to managing this question is understand which operator to use to do the opposite of an operation
- the ~ (not) operator. In addition, you should know that there is no unique() method.
Static notebook | Dynamic notebook: See test 1
NEW QUESTION 89
The code block shown below should return a new 2-column DataFrame that shows one attribute from column attributes per row next to the associated itemName, for all suppliers in column supplier whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Sample of DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+ Code block:
itemsDf.__1__(__2__).select(__3__, __4__)
- A. 1. where
2. col(supplier).contains("Sports")
3. explode(attributes)
4. itemName - B. 1. where
2. "Sports".isin(col("Supplier"))
3. "itemName"
4. array_explode("attributes") - C. 1. filter
2. col("supplier").contains("Sports")
3. "itemName"
4. explode("attributes") - D. 1. filter
2. col("supplier").isin("Sports")
3. "itemName"
4. explode(col("attributes")) - E. 1. where
2. col("supplier").contains("Sports")
3. "itemName"
4. "attributes"
Answer: C
Explanation:
Explanation
Output of correct code block:
+----------------------------------+------+
|itemName |col |
+----------------------------------+------+
|Thick Coat for Walking in the Snow|blue |
|Thick Coat for Walking in the Snow|winter|
|Thick Coat for Walking in the Snow|cozy |
|Outdoors Backpack |green |
|Outdoors Backpack |summer|
|Outdoors Backpack |travel|
+----------------------------------+------+
The key to solving this question is knowing about Spark's explode operator. Using this operator, you can extract values from arrays into single rows. The following guidance steps through the answers systematically from the first to the last gap. Note that there are many ways to solving the gap questions and filtering out wrong answers, you do not always have to start filtering out from the first gap, but can also exclude some answers based on obvious problems you see with them.
The answers to the first gap present you with two options: filter and where. These two are actually synonyms in PySpark, so using either of those is fine. The answer options to this gap therefore do not help us in selecting the right answer.
The second gap is more interesting. One answer option includes "Sports".isin(col("Supplier")). This construct does not work, since Python's string does not have an isin method. Another option contains col(supplier). Here, Python will try to interpret supplier as a variable. We have not set this variable, so this is not a viable answer. Then, you are left with answers options that include col ("supplier").contains("Sports") and col("supplier").isin("Sports"). The question states that we are looking for suppliers whose name includes Sports, so we have to go for the contains operator here.
We would use the isin operator if we wanted to filter out for supplier names that match any entries in a list of supplier names.
Finally, we are left with two answers that fill the third gap both with "itemName" and the fourth gap either with explode("attributes") or "attributes". While both are correct Spark syntax, only explode ("attributes") will help us achieve our goal. Specifically, the question asks for one attribute from column attributes per row - this is what the explode() operator does.
One answer option also includes array_explode() which is not a valid operator in PySpark.
More info: pyspark.sql.functions.explode - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 90
The code block displayed below contains an error. The code block should return all rows of DataFrame transactionsDf, but including only columns storeId and predError. Find the error.
Code block:
spark.collect(transactionsDf.select("storeId", "predError"))
- A. The take method should be used instead of the collect method.
- B. Instead of collect, collectAsRows needs to be called.
- C. The collect method is not a method of the SparkSession object.
- D. Instead of select, DataFrame transactionsDf needs to be filtered using the filter operator.
- E. Columns storeId and predError need to be represented as a Python list, so they need to be wrapped in brackets ([]).
Answer: C
Explanation:
Explanation
Correct code block:
transactionsDf.select("storeId", "predError").collect()
collect() is a method of the DataFrame object.
More info: pyspark.sql.DataFrame.collect - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 91
Which of the following code blocks reads in parquet file /FileStore/imports.parquet as a DataFrame?
- A. spark.mode("parquet").read("/FileStore/imports.parquet")
- B. spark.read.path("/FileStore/imports.parquet", source="parquet")
- C. spark.read().format('parquet').open("/FileStore/imports.parquet")
- D. spark.read.parquet("/FileStore/imports.parquet")
- E. spark.read().parquet("/FileStore/imports.parquet")
Answer: D
Explanation:
Explanation
Static notebook | Dynamic notebook: See test 1
(https://flrs.github.io/spark_practice_tests_code/#1/23.html ,
https://bit.ly/sparkpracticeexams_import_instructions)
NEW QUESTION 92
Which of the following code blocks creates a new DataFrame with 3 columns, productId, highest, and lowest, that shows the biggest and smallest values of column value per value in column productId from DataFrame transactionsDf?
Sample of DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.| 4| null| null| 3| 2|null|
8.| 5| null| null| null| 2|null|
9.| 6| 3| 2| 25| 2|null|
10.+-------------+---------+-----+-------+---------+----+
- A. transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})
- B. transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))
- C. transactionsDf.max('value').min('value')
- D. transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest'))
- E. transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest'))
Answer: D
Explanation:
Explanation
transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest')) Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups.
transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")}) Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong.
If you use a dictionary, the syntax should be like {"value": "max"}, so using the column name as the key and the aggregating function as value.
transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest')) Incorrect. While this is valid Spark syntax, it does not achieve what the question asks for. The question specifically asks for values to be aggregated per value in column productId - this column is not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group.
transactionsDf.max('value').min('value')
Wrong. There is no DataFrame.max() method in Spark, so this command will fail.
transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest")) No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand which columns you want to aggregate.
More info: pyspark.sql.DataFrame.agg - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 93
Which of the following code blocks returns a DataFrame showing the mean value of column "value" of DataFrame transactionsDf, grouped by its column storeId?
- A. transactionsDf.groupBy("storeId").avg(col("value"))
- B. transactionsDf.groupBy("storeId").agg(avg("value"))
- C. transactionsDf.groupBy(col(storeId).avg())
- D. transactionsDf.groupBy("value").average()
- E. transactionsDf.groupBy("storeId").agg(average("value"))
Answer: B
Explanation:
Explanation
This question tests your knowledge about how to use the groupBy and agg pattern in Spark. Using the documentation, you can find out that there is no average() method in pyspark.sql.functions.
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 94
Which of the following describes characteristics of the Spark UI?
- A. Via the Spark UI, stage execution speed can be modified.
- B. There is a place in the Spark UI that shows the property spark.executor.memory.
- C. Some of the tabs in the Spark UI are named Jobs, Stages, Storage, DAGs, Executors, and SQL.
- D. The Scheduler tab shows how jobs that are run in parallel by multiple users are distributed across the cluster.
- E. Via the Spark UI, workloads can be manually distributed across executors.
Answer: B
Explanation:
Explanation
There is a place in the Spark UI that shows the property spark.executor.memory.
Correct, you can see Spark properties such as spark.executor.memory in the Environment tab.
Some of the tabs in the Spark UI are named Jobs, Stages, Storage, DAGs, Executors, and SQL.
Wrong - Jobs, Stages, Storage, Executors, and SQL are all tabs in the Spark UI. DAGs can be inspected in the
"Jobs" tab in the job details or in the Stages or SQL tab, but are not a separate tab.
Via the Spark UI, workloads can be manually distributed across distributors.
No, the Spark UI is meant for inspecting the inner workings of Spark which ultimately helps understand, debug, and optimize Spark transactions.
Via the Spark UI, stage execution speed can be modified.
No, see above.
The Scheduler tab shows how jobs that are run in parallel by multiple users are distributed across the cluster.
No, there is no Scheduler tab.
NEW QUESTION 95
Which of the following statements about garbage collection in Spark is incorrect?
- A. Manually persisting RDDs in Spark prevents them from being garbage collected.
- B. Serialized caching is a strategy to increase the performance of garbage collection.
- C. In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector.
- D. Garbage collection information can be accessed in the Spark UI's stage detail view.
- E. Optimizing garbage collection performance in Spark may limit caching ability.
Answer: A
Explanation:
Explanation
Manually persisting RDDs in Spark prevents them from being garbage collected.
This statement is incorrect, and thus the correct answer to the question. Spark's garbage collector will remove even persisted objects, albeit in an "LRU" fashion. LRU stands for least recently used.
So, during a garbage collection run, the objects that were used the longest time ago will be garbage collected first.
See the linked StackOverflow post below for more information.
Serialized caching is a strategy to increase the performance of garbage collection.
This statement is correct. The more Java objects Spark needs to collect during garbage collection, the longer it takes. Storing a collection of many Java objects, such as a DataFrame with a complex schema, through serialization as a single byte array thus increases performance. This means that garbage collection takes less time on a serialized DataFrame than an unserialized DataFrame.
Optimizing garbage collection performance in Spark may limit caching ability.
This statement is correct. A full garbage collection run slows down a Spark application. When taking about
"tuning" garbage collection, we mean reducing the amount or duration of these slowdowns.
A full garbage collection run is triggered when the Old generation of the Java heap space is almost full. (If you are unfamiliar with this concept, check out the link to the Garbage Collection Tuning docs below.) Thus, one measure to avoid triggering a garbage collection run is to prevent the Old generation share of the heap space to be almost full.
To achieve this, one may decrease its size. Objects with sizes greater than the Old generation space will then be discarded instead of cached (stored) in the space and helping it to be "almost full".
This will decrease the number of full garbage collection runs, increasing overall performance.
Inevitably, however, objects will need to be recomputed when they are needed. So, this mechanism only works when a Spark application needs to reuse cached data as little as possible.
Garbage collection information can be accessed in the Spark UI's stage detail view.
This statement is correct. The task table in the Spark UI's stage detail view has a "GC Time" column, indicating the garbage collection time needed per task.
In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector.
This statement is correct. The G1 garbage collector, also known as garbage first garbage collector, is an alternative to the default Parallel garbage collector.
While the default Parallel garbage collector divides the heap into a few static regions, the G1 garbage collector divides the heap into many small regions that are created dynamically. The G1 garbage collector has certain advantages over the Parallel garbage collector which improve performance particularly for Spark workloads that require high throughput and low latency.
The G1 garbage collector is not enabled by default, and you need to explicitly pass an argument to Spark to enable it. For more information about the two garbage collectors, check out the Databricks article linked below.
NEW QUESTION 96
The code block displayed below contains at least one error. The code block should return a DataFrame with only one column, result. That column should include all values in column value from DataFrame transactionsDf raised to the power of 5, and a null value for rows in which there is no value in column value. Find the error(s).
Code block:
1.from pyspark.sql.functions import udf
2.from pyspark.sql import types as T
3.
4.transactionsDf.createOrReplaceTempView('transactions')
5.
6.def pow_5(x):
7. return x**5
8.
9.spark.udf.register(pow_5, 'power_5_udf', T.LongType())
10.spark.sql('SELECT power_5_udf(value) FROM transactions')
- A. The pow_5 method is unable to handle empty values in column value, the UDF function is not registered properly with the Spark driver, and the name of the column in the returned DataFrame is not result.
- B. The pow_5 method is unable to handle empty values in column value and the name of the column in the returned DataFrame is not result.
- C. The returned DataFrame includes multiple columns instead of just one column.
- D. The pow_5 method is unable to handle empty values in column value, the name of the column in the returned DataFrame is not result, and the SparkSession cannot access the transactionsDf DataFrame.
- E. The pow_5 method is unable to handle empty values in column value, the name of the column in the returned DataFrame is not result, and Spark driver does not call the UDF function appropriately.
Answer: E
Explanation:
Explanation
Correct code block:
from pyspark.sql.functions import udf
from pyspark.sql import types as T
transactionsDf.createOrReplaceTempView('transactions')
def pow_5(x):
if x:
return x**5
return x
spark.udf.register('power_5_udf', pow_5, T.LongType())
spark.sql('SELECT power_5_udf(value) AS result FROM transactions')
Here it is important to understand how the pow_5 method handles empty values. In the wrong code block above, the pow_5 method is unable to handle empty values and will throw an error, since Python's ** operator cannot deal with any null value Spark passes into method pow_5.
The order of arguments for registering the UDF function with Spark via spark.udf.register matters. In the code snippet in the question, the arguments for the SQL method name and the actual Python function are switched. You can read more about the arguments of spark.udf.register and see some examples of its usage in the documentation (link below).
Finally, you should recognize that in the original code block, an expression to rename column created through the UDF function is missing. The renaming is done by SQL's AS result argument.
Omitting that argument, you end up with the column name power_5_udf(value) and not result.
More info: pyspark.sql.functions.udf - PySpark 3.1.1 documentation
NEW QUESTION 97
Which of the following code blocks displays the 10 rows with the smallest values of column value in DataFrame transactionsDf in a nicely formatted way?
- A. transactionsDf.sort(col("value").asc()).print(10)
- B. transactionsDf.sort(col("value")).show(10)
- C. transactionsDf.sort(col("value").desc()).head()
- D. transactionsDf.sort(asc(value)).show(10)
- E. transactionsDf.orderBy("value").asc().show(10)
Answer: B
Explanation:
Explanation
show() is the correct method to look for here, since the question specifically asks for displaying the rows in a nicely formatted way. Here is the output of show (only a few rows shown):
+-------------+---------+-----+-------+---------+----+---------------+
|transactionId|predError|value|storeId|productId| f|transactionDate|
+-------------+---------+-----+-------+---------+----+---------------+
| 3| 3| 1| 25| 3|null| 1585824821|
| 5| null| 2| null| 2|null| 1575285427|
| 4| null| 3| 3| 2|null| 1583244275|
+-------------+---------+-----+-------+---------+----+---------------+
With regards to the sorting, specifically in ascending order since the smallest values should be shown first, the following expressions are valid:
- transactionsDf.sort(col("value")) ("ascending" is the default sort direction in the sort method)
- transactionsDf.sort(asc(col("value")))
- transactionsDf.sort(asc("value"))
- transactionsDf.sort(transactionsDf.value.asc())
- transactionsDf.sort(transactionsDf.value)
Also, orderBy is just an alias of sort, so all of these expressions work equally well using orderBy.
Static notebook | Dynamic notebook: See test 1
NEW QUESTION 98
Which of the following code blocks performs an inner join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively, excluding columns value and storeId from DataFrame transactionsDf and column attributes from DataFrame itemsDf?
- A. transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId)
- B. 1.transactionsDf.createOrReplaceTempView('transactionsDf')
2.itemsDf.createOrReplaceTempView('itemsDf')
3.
4.spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes") - C. 1.transactionsDf \
2. .drop(col('value'), col('storeId')) \
3. .join(itemsDf.drop(col('attributes')), col('productId')==col('itemId')) - D. 1.transactionsDf.createOrReplaceTempView('transactionsDf')
2.itemsDf.createOrReplaceTempView('itemsDf')
3.
4.statement = """
5.SELECT * FROM transactionsDf
6.INNER JOIN itemsDf
7.ON transactionsDf.productId==itemsDf.itemId
8."""
9.spark.sql(statement).drop("value", "storeId", "attributes") - E. transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"),
"transactionsDf.productId==itemsDf.itemId")
Answer: D
Explanation:
Explanation
This question offers you a wide variety of answers for a seemingly simple question. However, this variety reflects the variety of ways that one can express a join in PySpark. You need to understand some SQL syntax to get to the correct answer here.
transactionsDf.createOrReplaceTempView('transactionsDf')
itemsDf.createOrReplaceTempView('itemsDf')
statement = """
SELECT * FROM transactionsDf
INNER JOIN itemsDf
ON transactionsDf.productId==itemsDf.itemId
"""
spark.sql(statement).drop("value", "storeId", "attributes")
Correct - this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows you to express strings as multiple lines.
transactionsDf \
drop(col('value'), col('storeId')) \
join(itemsDf.drop(col('attributes')), col('productId')==col('itemId'))
No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop('value', 'storeId') instead.
transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"),
"transactionsDf.productId==itemsDf.itemId")
Incorrect - Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression.
This would work if it would not be a string.
transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId) Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.
transactionsDf.createOrReplaceTempView('transactionsDf')
itemsDf.createOrReplaceTempView('itemsDf')
spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes") No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.
More info: pyspark.sql.DataFrame.join - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 99
Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has
10 partitions?
- A. transactionsDf.repartition(transactionsDf._partitions+2)
- B. transactionsDf.coalesce(10)
- C. transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)
- D. transactionsDf.repartition(transactionsDf.getNumPartitions()+2)
- E. transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)
Answer: E
Explanation:
Explanation
transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)
Correct. The repartition operator is the correct one for increasing the number of partitions. calling getNumPartitions() on DataFrame.rdd returns the current number of partitions.
transactionsDf.coalesce(10)
No, after this command transactionsDf will continue to only have 8 partitions. This is because coalesce() can only decreast the amount of partitions, but not increase it.
transactionsDf.repartition(transactionsDf.getNumPartitions()+2)
Incorrect, there is no getNumPartitions() method for the DataFrame class.
transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)
Wrong, coalesce() can only be used for reducing the number of partitions and there is no getNumPartitions() method for the DataFrame class.
transactionsDf.repartition(transactionsDf._partitions+2)
No, DataFrame has no _partitions attribute. You can find out the current number of partitions of a DataFrame with the DataFrame.rdd.getNumPartitions() method.
More info: pyspark.sql.DataFrame.repartition - PySpark 3.1.2 documentation, pyspark.RDD.getNumPartitions - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
NEW QUESTION 100
Which of the following statements about Spark's execution hierarchy is correct?
- A. In Spark's execution hierarchy, a stage comprises multiple jobs.
- B. In Spark's execution hierarchy, executors are the smallest unit.
- C. In Spark's execution hierarchy, tasks are one layer above slots.
- D. In Spark's execution hierarchy, a job may reach over multiple stage boundaries.
- E. In Spark's execution hierarchy, manifests are one layer above jobs.
Answer: D
Explanation:
Explanation
In Spark's execution hierarchy, a job may reach over multiple stage boundaries.
Correct. A job is a sequence of stages, and thus may reach over multiple stage boundaries.
In Spark's execution hierarchy, tasks are one layer above slots.
Incorrect. Slots are not a part of the execution hierarchy. Tasks are the lowest layer.
In Spark's execution hierarchy, a stage comprises multiple jobs.
No. It is the other way around - a job consists of one or multiple stages.
In Spark's execution hierarchy, executors are the smallest unit.
False. Executors are not a part of the execution hierarchy. Tasks are the smallest unit!
In Spark's execution hierarchy, manifests are one layer above jobs.
Wrong. Manifests are not a part of the Spark ecosystem.
NEW QUESTION 101
The code block displayed below contains an error. The code block should return a DataFrame where all entries in column supplier contain the letter combination et in this order. Find the error.
Code block:
itemsDf.filter(Column('supplier').isin('et'))
- A. The expression inside the filter parenthesis is malformed and should be replaced by isin('et', 'supplier').
- B. Instead of isin, it should be checked whether column supplier contains the letters et, so isin should be replaced with contains. In addition, the column should be accessed using col['supplier'].
- C. The Column operator should be replaced by the col operator and instead of isin, contains should be used.
- D. The expression only returns a single column and filter should be replaced by select.
Answer: A
Explanation:
Explanation
Correct code block:
itemsDf.filter(col('supplier').contains('et'))
A mixup can easily happen here between isin and contains. Since we want to check whether a column
"contains" the values et, this is the operator we should use here. Note that both methods are methods of Spark's Column object. See below for documentation links.
A specific Column object can be accessed through the col() method and not the Column() method or through col[], which is an essential thing to know here. In PySpark, Column references a generic column object. To use it for queries, you need to link the generic column object to a specific DataFrame. This can be achieved, for example, through the col() method.
More info:
- isin documentation: pyspark.sql.Column.isin - PySpark 3.1.1 documentation
- contains documentation: pyspark.sql.Column.contains - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1
NEW QUESTION 102
......
Practice LATEST Associate-Developer-Apache-Spark Exam Updated 179 Questions: https://passguide.prep4pass.com/Associate-Developer-Apache-Spark_exam-braindumps.html
