#Spark | #Databricks | #Photon | #Data Engineer

August 12, 2022

Testing Databricks Photon

I was a bit skeptical about photon since I realized that it cost about double the amount of DBU, required specifically optimized machines and did not support UDFs (it was my main target).

From the Databricks Official Docs:

Limitations

Does not support Spark Structured Streaming.
Does not support UDFs.
Does not support RDD APIs.
Not expected to improve short-running queries (<2 seconds), for example, queries against small amounts of data.

Photon runtime

But I needed to create an aggregate of the user behavior in my work’s app dealing with hundreds of millions of rows and decided to give it a try.

I did a calculation for the last two months. The machine used was memory optimizing for being able to run photons on top of it.

4 Workers 256 GB Memory 32 Cores 1 Driver 64 GB Memory, 8 Cores Runtime 10.4.x-scala2.12

Run	Time
July	11m 50s
June	12m 12s

With a cost of 20 DBU/h ($0.10 / DBU is the price por premium jobs workloads)

I did a launch for 3 months. The process is extremely simple, read some partitions, filter on some value, group by and count. About the volume, 1 month of data has this amount of rows

number of output batches	2,774,302
cumulative time total (min, med, max)	41.3 m (685 ms, 962 ms, 7.0 s)
rows output	11,200,056,905

And it is aggregated as:

num batches aggregated in sparse mode	898,423
rows output	1,243,774,432

I didn’t feel happy about this, twice the cost, and forced me to use a kind of machine that didn’t appeal to me as necessary. Maybe it wasn’t enough data. So I did a 3-month calculation with photons and without.

Run	Time	Photon
December-October	23m 53s	YES
September - June		NO

Damn, that was unexpected. The computation time didn’t increase that much for Photon and went wild on non-photon workload. We are talking about 3x speed as they stated in their slogan. We are talking about (24 / 60) hour * 20 DBU / hour * 0.15 $/DBU = 1.2$ vs (86 / 60) hour * 10 DBU / hour * 0,15$/DBU = 2.15$

Well as far as I can tell photon works smoothly but only under a heavy workload. I’d mainly use it for specific KPIs or repopulate tables after changes, but it will be the defacto for those tasks.

Maybe others have a much better experience with it!

2017-2022 Adrián Abreu powered by Hugo and Kiss Theme