Often companies interested in TimeXtender (https://www.timextender.com/) have substantial investments in Data Lakes in which they have developed complex transformations using analytics engines such as Spark. Also there have been various requests regarding processing data with Databricks (https://databricks.com/) and using Delta Lake (https://delta.io/) in a TimeXtender environment. This requests often arise when data is to be analyzed directly from the Data Lake rather than from a relational database.

TimeXtender relies on Data Lake on the Azure Cloud as a repository to store raw data from source systems, before transforming and loading it into a data warehouse.

The data lake can be accessed directly with solutions like Databricks or Azure Synapse Analytics. The only challenge is that TimeXtender stores the processed data in versioned folders, so that when reading the main directory duplicates would occur, as seen in in the following example using the entity person_address.

With PySpark (PySpark Documentation — PySpark 3.2.0 documentation (apache.org)), however, this can be solved very easily by determining the last version for each entity. The * in the path serves as a placeholder to read all subdirectories with the name DATA into a Spark DataFrame. The input_file_name() function returns the full path and the split() function extracts the versioned directory name so it can be added to the DataFrame.

from pyspark.sql import functions as sql_func

# Read entity and add path and version

df_person_adr = (
spark.read.parquet(«/mnt/TimeXtenderLake/AdventureW/person_address/*/DATA/»)
.withColumn(«path», sql_func.input_file_name())
.withColumn(«version», sql_func.split(«path»,»/»)[5])
)

# Get last version

max_version = df_person_adr.agg({«version»: «max»}).collect()[0][0]

# Filter DataFrame by last version and store in published folder

df_person_adr_last = df_person_adr.where((sql_func.col(«version») == max_version)).drop(«path», «version»)
display(df_person_adr_last.sort(«AddressID»))
df_person_adr_last.write.format(«delta»).mode(«overwrite»).parquet(«/mnt/TimeXtenderLake/AdventureW_published/person_address/»)

Using this simple solution, the data can be read from the TimeXtender repository and made available in Data Lake or Delta Lake for further processing with any data science tool.

A similar solution using the Azure Synapse Serverless Pool was published in the following video by TimeXtender:

Building the Bridge from TimeXtender to Spark

# Read entity and add path and version

# Get last version

Kategorien

Kontakt

Adresse