Hi, I’m Owen.
I finished my second year of study at The University of Sheffield in June and have spent my summer interning at The Data Refinery. After spending a week at The Data Shed (the parent company to The Data Refinery) last year, I was excited to return and have a much longer stint, contributing to The Data Refinery product and learning a lot along the way.
Over the course of my internship with The Data Refinery, I worked on creating a new orchestration framework with the aim of consolidating all the platforms scheduling and monitoring activities centrally. Before arriving, the team had completed several discovery exercises to evaluate the best orchestration library. A Python based orchestration framework called Prefect appeared to meet many of the requirements, but further validation was still required.
With the end goal defined, a list of deliverables was agreed:
- Research the tool
- Determine suitability
- Document the benefits and drawbacks
- Produce a proof-of-concept
- Orchestrate existing data services
- Promote changes to the production environment
With a longer-term overview of where the project would be going, and an understanding of our goals, I had plenty to get stuck into.
So, why use a standalone orchestration tool?
Because The Data Refinery runs a large range of data services across a of number of different data platforms (Databricks, Azure, Snowflake, plus others), those data services were monitored and managed entirely within the platform of operation.
The plan was to move away from this distributed scheduling model to a centralised model allowing for the execution of specific data services on-demand as well as speeding up data processing times overall, by parallelising as many services as possible.
Whilst this change would undoubtedly improve, simplify, and standardise how teams build services, the core goal for The Data Refinery was to ensure customers have access to their data and high-value insights in the fastest possible time.
The data services
Every day, The Data Refinery platform services run on a defined schedule or via specific events or triggers. The services have a broad range of responsibilities including:
- Ensuring that data is synchronised from our customers’ data sources
- Sending information to data destinations
- Completing data cleansing and standardisation operations
- Generating various segmentations
- Performing ML and AI workloads
- Handling notifications and alerts
Meeting customer needs
One of the long-term goals for The Data Refinery is to provide real-time analytics for customers (should they need it). To meet this goal, data services should be scheduled in a way that the total execution time of all dependent workflows are limited to our most complex or longest running tasks.
Getting analytical output into the hands of users in real time, ensures that decisions based on that data can be made sooner, without the need to await scheduled or batched based processes. In some cases, customers don’t need anything more than hourly or daily updates to analytical feeds, but for others the quicker they can react to a change in data, the better.
How did my intern project shape up?
A big part of joining a new team was understanding exactly how The Data Refinery’s platform has been architected. Understanding that meant learning how the Refinery works at a high level and gradually working down into the details (often into a specific service). Having an overview of everything when I joined was invaluable. While it was difficult to gain a solid understanding straight away, the high level understanding I had helped me fit things I picked up along the way into my mental map of The Data Refinery product and its workings.
Research and suitability
Being a student in the second year of a computer science degree, I have never had much experience of evaluating a technical offering, nor have I worked with Platform as a Service (PaaS) frameworks. Researching a new technology and deciding on whether to use it was something I had never had to do before, and it was an interesting process to go through.
The Prefect platform has a great mission statement, that helps to describe why it is a good fit for the use cases in question: “Orchestrate and observe all of your workflows, like air traffic control for your data.”
Even with Prefect as a seemingly good fit for the use case described, a key decision was required in relation to the version of the platform to use. The emergence of Prefect 2.0 meant there was a choice to use the old stable platform or take advantage of the new platform and the raft of improvements promised.
The Data Refinery typically use timeboxed tasks (spikes) to evaluate tooling or to prove/disprove theories and issues. In this case a 3-day spike was defined to allow me to get to grips with the Prefect platform, produce some very basic example data flows and confirm much of the initial research conducted by the team previously. Most importantly it allowed me to validate that Prefect 2.0 was the version of choice.
Proving the concept
After deciding on the use of Prefect, I started putting together some proof of concepts. For a good chunk of my internship, Prefect 2.0 was still in beta which saw rapid releases and frequent changes to the underlying code base. This caused a few headaches as the initial services I built often required modification. There would also occasionally be errors in the Prefect code itself, meaning I wouldn’t always be sure if an issue was a result of my own code or the Prefect library.
This beta experience was really rewarding, as it gave me the opportunity to engage with the Prefect community and developers. It also gave me the opportunity to dive into the Prefect codebase to get a reasonably intimate understanding of Prefect, that meant I could quickly work out where problems were coming from and report those issues directly to the Prefect team.
Migrating data services
To complete a lightweight migration of the various platform data services, it was decided that the services would be updated to no longer include any notion of scheduling or peer dependencies. This decoupling ensured that whilst the services were already mostly built to address single concern, they could be managed by an external scheduler.
This process was interesting as it meant reading other people’s code and doing a lot of refactoring and removal of now redundant code. This mini project didn’t take too long but carried good value as well as being useful for me personally.
With a subset of the core data services readied for centralised scheduling, I set about ensuring that Prefect was deployed within The Data Refinery platform. The Data refinery has a real focus on deployment automation and infrastructure as code, this means that the resources required for Prefect to execute needed to be deployed in each environment with access restricted to just the target services for scheduling.
The most challenging aspect of this work was the creation of a virtual agent that would be used as the base instance of the Prefect job runner. Setting up the virtual machine with the correct settings, framework dependencies and permissions required a lot of testing and research to meet both the Prefect and internal security requirements.
By the end of my internship, I successfully migrated a subset of core data services to be orchestrated by Prefect. This provided more control over what is running when, with much more flexibility to offer other features in the future.
A high-level view of the changes are captured below.
With the whole focus of my time working with The Data Refinery being on one well-defined project, I knew it would be great experience - I had no idea I would learn so much along the way.
From the project itself, I have had to improve how I both read and write code, having little experience in working in a highly collaborative team. I have also learned a lot more about Linux and the importance of deployment automation in the process of automating the creation of the VM for our Prefect agent. The importance of ensuring the deployment steps are repeatable and written in code is not really promoted in the early years of university but cannot be understated in a working engineering environment.
I have also been much more exposed to cloud computing; something that is almost impossible to avoid in industry, yet it has not yet been covered in detail in my university course.
Aside from the code and technical aspects of my work, I learned new strategies for working effectively within a larger team as well as how larger scaled projects are generally structured. I have also seen first-hand the importance of good workplace culture, which is a big part of why I really enjoyed my time with The Data Refinery and hope I’ll be back.