Collecting premiums is, along with settling claims, one of the key operations of an insurance company. Having a billing system that's accurate, transparent and easy to understand is essential to maintaining our clients' trust. And in the insurance world, trust is everything!
In 2018, Alan built its billing system from scratch, after relying on a provider for the first two years of the company's life. This post relates how we built a very critical piece of our infrastructure and managed to greatly increase the accuracy and automation of our billing.
As a health insurance company, we bill for the periods for which we provide coverage. The price depends on a few factors: age, number of dependents, level of coverage. This sounds super simple, right?
It turns out it's a bit more complex, in large part because we allow changes to be made both in the future and in the past. Some examples:
This flexibility on dates is partly a legal requirement, and also something we pride ourselves in offering to our clients. It was unfortunately something our first provider did not handle very well, which led to errors and quite a bit of manual verification. This flexibility, and the need to control proration rules and the timing of billing cycles, also made it difficult for us to use something like Stripe Billing (we do use Stripe for payments, and are very happy with their service).
Here are the main properties we wanted to get out of our billing system:
At its core, a billing system is just two things: 1. Computing premiums: they are the amounts due for a single user’s coverage. 2. At regular intervals, grouping these premiums into invoices.
Our Premium model looks roughly like this:
The premium table is append-only and all columns are immutable, except for invoice_id which is populated when the premium is included in an invoice. This immutability, associated with creation timestamps, allows us to see exactly the state at any point in time.
Creating invoices is incredibly simple: sum all premiums with no invoice_id for the current and prior months, then populate the invoice_id. This link between premiums and invoices makes it very easy for our support team to explain why our invoice amounts are what they are.
When the amounts for a given month change, we create new rows that adjust the total amount for the month. Instead of just adding the "delta", we create a "cancelling" row with a negative amount equal to the original amount, then a third row with the new amount. This gives us the property that the latest row is the correct premium amount, while the previous rows sum up to 0. This is helpful when debugging because we're all familiar with our pricing, so we can eyeball these amounts more easily than if we had to sum up a lot of deltas. It also makes our data analysts' lives easier.
Every day, we recompute all the premiums for all our users over all time in a big batch!
This may seem wasteful at first, but keep in mind that users usually have just one Premium row per month. It's also a workload that's very easy to parallelize, because one client's contract doesn't interact in any way with another's contract.
For a while, we did explore whether we should make this premium computation “event-based”, and only insert Premium rows in response to certain events. We quickly realized that the complexity of listing all the events that affected billing, and their effects, and ensuring that nothing was missed as our product evolved, was not worth the performance gain. Recomputing everything all the time provides us with a lot of confidence that any coverage change or code fix is reflected correctly in billing.
In order to update the Premiums, we compute a "new set" of Premiums, based on the current state of users and we compare it with an "old set" of Premiums, currently in the database table. Any difference between the sets leads to the "cancellation" of old rows and the creation of new ones (remember, the table is append-only).
If the premiums for a month are updated, it's very easy to re-invoice that particular month, or include the adjustment in the following month's invoice: all you need to do is select the newly created rows and mark them with an invoice_id.
We do want to avoid situations where multiple threads are inserting Premium rows at the same time, which could lead to the same rows being computed and added twice. We control this very simply, without any locking. All we do is have a "version" column to the Premium table, and a unique constraint on (user_id, month, version). The version is computed by incrementing the previous version for the (user_id, month) pair. Two threads updating the same premiums would compute the same next version and be prevented from both committing.
The remarkable feature of this simple design is that it is very easy to deal with billing bugs. On a few occasions, when we rolled out product changes, we made mistakes in how those changes affected Premiums.
For example, when we allowed individual clients (ie not on a company plan) to add their spouses as dependents, we computed the price for the spouses based not on their age but on the age of the primary user. This caused a lot of incorrect Premium rows to be created.
However, after quickly fixing the initial bug, all we had to do was wait for the next daily job to run, and all the Premiums were fixed. We didn't have to do any deep dive into the billing data to figure out who was affected. We didn't have to manual edit any billing data. It was all "self-healing". And in the event that some incorrect invoices had already been generated, we knew that the adjustments would automatically be on the next invoices.
This gives us a lot of peace of mind about the accuracy of our billing.
At Alan, we try to leverage database constraints to enforce a lot of properties we expect about our data. However, there is only so much you can enforce with database constraints. You certainly can't do table scans on each update, nor join to other tables. So we've supplemented our database constraints with other consistency checks. For example, we verify that the number of days billed for each user is equal to the number of covered days for that user. These checks are run daily and we get alerts in Slack when they detect problems. There is a nifty feature of Metabase (the open-source BI tool we use at Alan) that makes this alerting trivial to set up.
It took me a while to get around to writing this post, in large part because I felt the system was too simple to be worth writing about. Yet this simplicity is deceptive: it's actually the result of a lot of work and lessons learned by the team. The system has been running in production for the past 6 months, and it has given back to our engineers, ops team and most importantly our users an enormous amount of time: for example, it took us 2-3 days per month to agree with our provider on amounts to debit, and this was with 5x fewer users than we have today! It is a good example of how we are working hard to bring simplicity and transparency to a complex and opaque industry.
If you too want to radically simplify the French healthcare system and improve the health of millions, reach out to us, we're always hiring skilled engineers.