#Spark | #Databricks | #Python

January 26, 2024

Querying the databricks api

Exploring databricks SQL usage At my company, we adopted databricks SQL for most of our users. Some users have developed applications that use the JDBC connector, some users have built their dashboards, and some users write plain ad-hoc queries. We wanted to know what they queried, so we tried to use Unity Catalog’s insights, but it wasn’t enough for our case. We work with IOT and we are interested in what filters they apply within our tables. Read more

#Spark | #Databricks | #Structured Streaming

October 27, 2023

Tweaking Spark Kafka

Well, I’m facing a huge interesting case. I’m working at Wallbox where we need to deal with billions of rows every day. Now we need to use Spark for some Kafka filtering and publish the results into different topics according to some rules. I won’t dig deep into the logic except for performance-related stuff, let’s try to increase the processing speed. When reading from Kafka you usually get 1 task per partition, so if you have 6 partitions and 48 cores you are not using 87. Read more

#Databricks | #Delta | #Unity Catalog

October 2, 2023

Repairing metadata unity catalog

I’ve been subscribed to https://www.dataengineeringweekly.com/p/data-engineering-weekly-148 for years. This last number included several on-call posts on Medium. I found these quite useful. Today, I got an alert from Metaplane that a cost monitor dashboard was out of date. I checked the processes, and everything was fine. I ran a query to check the freshness of the data and it was ok too. Metaplane checks our delta table freshness by querying the table information available in the Unity Catalog. Read more

#Databricks | #Workflows | #Airflow

July 28, 2023

Adding extra params on DatabricksRunNowOperator

With the new Databricks jobs API 2.1 you have different parameters depending on the kind of tasks you have in your workflow. Like: jar_params, sql_params, python_params, notebook_params… And not always the airflow operator is ready to handle all of the. If we check the current release of the DatabricksRunNowOperator, we can see that there is only support for: notebook_params python_params python_named_parameters jar_params spark_submit_params And not the query_params mentioned earlier. But there is a way of combining both, there is a param called jsob that allows you to write the payload of a databricksrunnow and it will also merge the content of the JSON with your named_params! Read more

#Databricks | #Terraform | #Unity Catalog

May 23, 2023

Enabling Unity Catalog

I’ve spent the last few weeks setting up the unity catalog for my company. It’s been an extremely tiring process. And there are several concepts to bring here. My main point is to have a clear view of the requirements. Disclaimer: as of today with https://github.com/databricks/terraform-provider-databricks release 1.17.0, some steps should be done in an “awkward way” that is, the account API does not expose the catalog’s endpoint and should be done through a workspace. Read more

#Spark | #Databricks | #Data Engineer

August 15, 2022

Optimizing Spark

Últimamente me he centrado en mejorar mis habilidades con Spark y he aprovechado para hacer algunos trainings de databricks. (Que por cierto ha sacado Beacons, un programa de reconocimiento para sus colaboradores y ha mencionado algunos nombres muy grandes por ahí). Y en estos cursos está optimizing spark, que simplifica y explica de una forma bastante sencilla los problemas de rendimientos que ocurren en el mundo de big data. A estos problemas se les denomina las 5s: Read more

#Spark | #Databricks | #Data Engineer

August 15, 2022

Optimizing Spark II

Continuando con la lista de optimizaciones en spark tneemos el spill. Hacer spill no es más que persistir un rdd en disco, ya que, sus datos no caben en memoria. Existen varias causas, la más sencilla de pensar es hacer un explode un array donde nuestras columnas crecen de forma exponencial. Cuando el spill ocurre se puede identificar por dos valores que siempre van de la mano: Spill (Memory) Spill (Disk) (Estas columnas solo aparecen en la spark ui si hay spill). Read more

#Spark | #Databricks | #Photon | #Data Engineer

August 12, 2022

Testing Databricks Photon

I was a bit skeptical about photon since I realized that it cost about double the amount of DBU, required specifically optimized machines and did not support UDFs (it was my main target). From the Databricks Official Docs: Limitations Does not support Spark Structured Streaming. Does not support UDFs. Does not support RDD APIs. Not expected to improve short-running queries (<2 seconds), for example, queries against small amounts of data. Photon runtime Read more

2017-2022 Adrián Abreu powered by Hugo and Kiss Theme