July 21, 2022

Associate Spark Developer Certification

Yesterday I took (and passed with more than 90% yay!) the Associate Spark Developer Certificaton. And before I forget I want to share my experience: In general: First of all, I needed to install Windows as there was no Linux support for the control software used during the exam. Secondly, you need to disable both the antivirus and the firewall before joining. I didn’t disable the antivirus and the technician contacted me as there was a problem with the webcam despite I was able to see myself. Read more

July 1, 2022

Reading firebase data

Firebase is a common component nowadays for most mobile apps. And it can provide some useful insights, for example in my previous company we use it to detect where the people left at the initial app wizard. (We could measure it). It is quite simple to export your data to BigQuery: https://firebase.google.com/docs/projects/bigquery-export But maybe your lake is in AWS or Azure. In the next lines, I will try to explain how to load the data in your lake and some improvements we have applied. Read more

June 30, 2022

Qbeast

A few days ago I ran into Qbeast which is an open-source project on top of delta lake I needed to dig into. This introductory post explains it quite well: https://qbeast.io/qbeast-format-enhanced-data-lakehouse/ The project is quite good and it seems helpful if you need to write your custom data source as everything is documented. And well as I’m in love with note-taking I want to dig into the following three topics: Read more

June 29, 2022

Spark Dates

I can perfectly describe this as the scariest part of the exam. I’m used to working with dates but I’m especially used to suffering from the typical UTC / not UTC / summer time hours difference. I will try to make some simple exercises for this, the idea would be: We have some sales data and god knows how the business people love to refresh super fast their dashboards on Databricks SQL. Read more

June 28, 2022

Spark Cert Exam Practice

--- primary_color: orange secondary_color: lightgray text_color: black shuffle_questions: false --- ## Which of the following statements about the Spark driver is incorrect? - [ ] The Spark driver is the node in which the Spark application's main method runs to ordinate the Spark application. - [X] The Spark driver is horizontally scaled to increase overall processing throughput. - [ ] The Spark driver contains the SparkContext object. - [ ] The Spark driver is responsible for scheduling the execution of data by various worker nodes in cluster mode. Read more

June 19, 2022

Spark User Defined Functions

Sometimes we need to execute arbitrary Scala code on Spark. We may need to use an external library or so on. For that, we have the UDF, which accepts and return one or more columns. When we have a function we need to register it on Spark so we can use it on our worker machines. If you are using Scala or Java, the udf can run inside the Java Virtual Machine so there’s a little extra penalty. Read more

June 11, 2022

Spark DataSources

As estated in the structured api section, Spark supports a lot of sources with a lot of options. There is no other goal for this post than to clarify how the most common ones work and how they will be converted to DataFrames. First, all the supported sources are listed here: https://spark.apache.org/docs/latest/sql-data-sources.html And we can focus on the typical ones: JSON, CSV and Parquet (as those are the typical format on open-source data). Read more

June 10, 2022

Spark Dataframes

Spark was initially released for dealing with a particular type of data called RDD. Nowadays we work with abstract structures on top of it, and the following tables summarize them. Type Description Advantages Datasets Structured composed of a list of where you can specify your custom class (only Scala) Type-safe operations, support for operations that cannot be expressed otherwise. Dataframes Datasets of type Row (a generic spark type) Allow optimizations and are more flexible SQL tables and views Same as Dataframes but in the scope of databases instead of programming languages Let’s dig into the Dataframes. Read more

2017-2022 Adrián Abreu powered by Hugo and Kiss Theme