#Spark | #Data Engineer | #Delta | #SQL

March 20, 2023

Duplicates with delta, how can it be?

Long time without writing! On highlights: I left my job at Schwarz It in December last year, and now I’m a full-time employee at Wallbox! I’m really happy with my new job, and I’ve experienced interesting stuff. This one was just one of these strange cases where you start doubting the compiler. Context One of my main tables represents sensor measures from our chargers with millisecond precision. The numbers are quite high, we are talking over 2 billion rows per day. Read more

#Spark | #Databricks | #Data Engineer

August 15, 2022

Optimizing Spark

Últimamente me he centrado en mejorar mis habilidades con Spark y he aprovechado para hacer algunos trainings de databricks. (Que por cierto ha sacado Beacons, un programa de reconocimiento para sus colaboradores y ha mencionado algunos nombres muy grandes por ahí). Y en estos cursos está optimizing spark, que simplifica y explica de una forma bastante sencilla los problemas de rendimientos que ocurren en el mundo de big data. A estos problemas se les denomina las 5s: Read more

#Spark | #Databricks | #Data Engineer

August 15, 2022

Optimizing Spark II

Continuando con la lista de optimizaciones en spark tneemos el spill. Hacer spill no es más que persistir un rdd en disco, ya que, sus datos no caben en memoria. Existen varias causas, la más sencilla de pensar es hacer un explode un array donde nuestras columnas crecen de forma exponencial. Cuando el spill ocurre se puede identificar por dos valores que siempre van de la mano: Spill (Memory) Spill (Disk) (Estas columnas solo aparecen en la spark ui si hay spill). Read more

#Spark | #Databricks | #Photon | #Data Engineer

August 12, 2022

Testing Databricks Photon

I was a bit skeptical about photon since I realized that it cost about double the amount of DBU, required specifically optimized machines and did not support UDFs (it was my main target). From the Databricks Official Docs: Limitations Does not support Spark Structured Streaming. Does not support UDFs. Does not support RDD APIs. Not expected to improve short-running queries (<2 seconds), for example, queries against small amounts of data. Photon runtime Read more

#Spark | #DataBricks | #Data Engineer

July 30, 2022

Databricks Cluster Management

For the last few months, I’ve been into ETL optimization. Most of the changes were as dramatic as moving tables from ORC to delta revamping the partition strategy to some as simple as upgrading the runtime version to 10.4 so the ETL starts using low-shuffle merge. But at my job, we have a lot of jobs. Each ETL can be easily launched at *30 with different parameters so I wanted to dig into the most effective strategy for it. Read more

#Spark | #Certification | #Data Engineer

July 21, 2022

Associate Spark Developer Certification

Yesterday I took (and passed with more than 90% yay!) the Associate Spark Developer Certificaton. And before I forget I want to share my experience: In general: First of all, I needed to install Windows as there was no Linux support for the control software used during the exam. Secondly, you need to disable both the antivirus and the firewall before joining. I didn’t disable the antivirus and the technician contacted me as there was a problem with the webcam despite I was able to see myself. Read more

#Spark | #Certification | #Data Engineer

June 29, 2022

Spark Dates

I can perfectly describe this as the scariest part of the exam. I’m used to working with dates but I’m especially used to suffering from the typical UTC / not UTC / summer time hours difference. I will try to make some simple exercises for this, the idea would be: We have some sales data and god knows how the business people love to refresh super fast their dashboards on Databricks SQL. Read more

#Spark | #Certification | #Data Engineer

June 28, 2022

Spark Cert Exam Practice

--- primary_color: orange secondary_color: lightgray text_color: black shuffle_questions: false --- ## Which of the following statements about the Spark driver is incorrect? - [ ] The Spark driver is the node in which the Spark application's main method runs to ordinate the Spark application. - [X] The Spark driver is horizontally scaled to increase overall processing throughput. - [ ] The Spark driver contains the SparkContext object. - [ ] The Spark driver is responsible for scheduling the execution of data by various worker nodes in cluster mode. Read more

2017-2022 Adrián Abreu powered by Hugo and Kiss Theme