AWS Startups Blog
How claimsforce Built a Future-Proof Lake House with AWS
Guest post by Robert Kossendey, Team Lead Data Chapter, Johannes Kotsch, Data Engineer, and Hans Hartmann, Data Engineer, of claimsforce
In July 2021, Germany and several other European countries were hit by a 100-year flood, which devastated property and killed more than 180 people. As a result, our partner insurers and damage adjusters received an unforeseen number of claims in a short amount of time from people in dire need of help.
At claimsforce, this spike in demand resulted in a 300% load increase on our systems, but we had no issues scaling our infrastructure. AWS’s seamless scalability meant that we could continue operating with no downtime or maintenance effort, which enabled us to focus fully on the response to the natural disaster, providing our partners with valuable insights, and offering assistance to the people whose lives had been affected by this horrible tragedy.
This catastrophic flooding is a striking example of how claimsforce can have meaningful real-world impact. As an InsurTech startup based in Hamburg, Germany, our mission is to create great claims experiences. Our products enable insurers and damage adjusters to process claims more efficiently by digitally enhancing their processes, as well as the customer experience. This makes us one of the key players in the digital transformation of the European insurance industry. We differentiate ourselves by using data to tackle difficult problems like fleet routing or recommendations and maximize the experience for our customers.
As a fast-growing startup, we deliver products to multiple insurers and damage adjuster networks, which leads to an ever-increasing amount of data stored in our application databases and external sources. Initially, we had no dedicated infrastructure in place to process the data. But performing tasks on distributed data has become harder, and the growing volume of data provides great opportunities to gain insights that will benefit our customers.
Why AWS?
We decided to build our data infrastructure on AWS because it offers serverless and fully-managed services. In contrast to traditional products, AWS services require no additional administrative effort on our part, giving us the ability to focus fully on data analysis and product development—something that has been especially valuable to us as a startup with limited resources. We also appreciate the seamless scalability and pay-per-use cost structure. At the moment, we use a relatively small amount of data compared to large enterprises, so we benefit from the flat cost structure of AWS services. However, we are also able to scale for the future and account for load peaks without a problem.
Choosing a Lake House architecture
We chose the Lake House architecture for our data infrastructure because we think it combines the best of both worlds from data lakes and data warehouses.
We regularly face rapid changes of requirements, leading to changes in available data and its structure. A data warehouse with a static schema was therefore not an option for us. Meanwhile, a pure data lake lacks analytic capabilities and can easily become a “data swamp” through a lack of schema enforcement, metadata management, and documentation.
AWS offers a range of services for implementing a Lake House, which can be integrated smoothly. We use Amazon Simple Storage (Amazon S3) as a standardized object store for data. AWS Glue is the go-to service for all kinds of data migration and ETLs. At the same time, AWS Glue provides a metadata store that documents and helps us enforce a schema on our object data. That process is very convenient and requires no update maintenance. The data can be queried directly using the Amazon Athena query engine, providing an ideal “single source of truth.”
Our architecture
Our main data sources are our Amazon DynamoDB application databases, external APIs, and user tracking data. We use AWS Lambda functions as a universal and adaptive way to gather data and write it to a specific Amazon S3 bucket. The S3 bucket provides a landing area for the raw data, with no transformations to eliminate any possible data loss. This also allows for future changes in analytics requirements, as the original data remains intact.
This raw data gets crawled by the AWS Glue crawler, scheduled daily. The crawler determines the schema of all our data and persists the schema in the AWS Glue data catalog. Changes to the schema get detected and updated at each crawler execution.
Cleansing in AWS Glue
The raw data from S3 is processed with several cleansing jobs in AWS Glue. These cleansing jobs use the data definitions in the AWS Glue data catalog as descriptions for the input data. During the cleaning process, we enforce different data standards, which we have defined together with all stakeholders, such as business analysts. These data standards include:
- Unified date format and time zone
- Standardized column-naming conventions (e.g., snake case)
- Handling of empty values (null instead of empty strings)
- Transformation of complex data types (arrays, maps, JSON-strings) into atomic data types
After enforcing these data standards, our cleansing jobs write the data into another S3 bucket that contains our processed data. The schema of our processed tables gets automatically updated in the AWS Glue data catalog by the AWS Glue job. To prevent our jobs from processing the same data twice, we are using the job bookmark feature that AWS Glue provides, which enables AWS Glue to compute only the data that was added since the last job execution.
We use Athena to perform SQL queries on data saved in our two S3 buckets. Athena is very well-suited to ad hoc requests because it can perform SQL queries directly on S3—no additional data moving to a data warehouse is needed. By keeping our metadata up to date using the AWS Glue data catalog, we always have a defined schema in our data buckets. Our data analysts and data scientists can perform data exploration for ad hoc requests, even on dark data.
We store data in the Parquet file format, a columnar storage format. This allows us to process and query data more efficiently, because not every row of data needs to be queried to get a result.
Since every cleansing job results in a single file, and because we run our cleansing jobs on an hourly basis, we end up with a relatively high number of files per day and an increased number of file reads from S3. To reduce the number of S3 get-requests (and costs), we wrote a Parquet compaction job, running on a daily basis. The job compacts all of our daily files, so they adhere to the most efficient Parquet file size, 512 MB to 1024 MB.
Our two use cases for data: business analytics and machine learning
For business analytics, a use case we see frequently is around the need to support our customers with valuable insights about the efficient use of their resources. For that, we use Amazon’s fully managed data warehouse, Amazon Redshift. This allows us to perform efficient SQL queries on large amounts of data.
The Redshift data is further aggregated to account for business-level requirements. We achieve these aggregations with our AWS Glue aggregation jobs. They read from multiple tables in the processed bucket, with the schemas defined in the AWS Glue data catalog, and perform the needed operations like aggregating or joining.
With those analytical capabilities, we cluster past claims by region and search for regions with many claims per damage adjuster, long response times or high rejection rates from damage adjusters. With the help of those insights, our partners can identify regions where more resources are needed to improve their customer experience.
We also use the data in Amazon Redshift as the basis for our visualization tools, Tableau and Amazon QuickSight. Those tools provide dashboards and analyses for our customers and for internal use.
In another use case, we use Amazon SageMaker, Amazon’s fully managed machine learning platform, to build our disposition engine. This feature recommends the best damage adjuster for a specific claim. We used historical data from claim assignments that we could directly extract from the processed S3-Bucket. Having Parquet as a standardized file format made it easier to ingest large data into our ML-Models than querying large amounts of data via SQL.
The use of our disposition engine resulted in better assignment decisions, using resources more efficiently. With the help of our predictions, driving times of damage adjusters could be reduced, and their workload could be distributed more equally.
Conclusion
Ultimately, our decision to build a Lake House with AWS has offered us clear benefits. For one, storing all ingested data in a raw bucket paid off. There were some cases where we needed to change the data structure in the processed bucket, but with all the raw data in place, we could just change the transformation logic, delete the data in the bucket, reset the job bookmark, and execute the AWS Glue job again. Moreover, the AWS Glue jobs have no difficulty processing large amounts of data. This is a huge advantage compared to other services that could be considered for data moving.
Overall, even after just a couple of months of using our Lake House, it’s clear that it’s had a positive impact on our business, allowing us to improve our efficiency and flexibility and focus on delivering the best possible customer experience. That’s especially important when dealing with stressful circumstances like a flood or other natural disaster, when speed, convenience, and reliability are key.
Robert Kossendey is the Team Lead Data Chapter at claimsforce. His focus areas are highly distributed cloud architectures and big data processing. If there is some free time, Robert likes to dive into Data Science and Machine Learning topics. | |
Johannes Kotsch is working as a Data Engineer in our disposition product team, which is in charge of matching the best damage adjustor to a claim. He is passionate about DynamoDB and everything related to distributed data in the cloud. Currently, he is also pursuing his Master’s degree in Computer Science with a focus on Big Data. | |
Hans Hartmann is a Data Engineer in one of our product teams, responsible for the claim assessment. In addition he is currently doing his masters in Data Science. He’s deeply passionate about everything related to data. |