So You've Decided to Build a Bioregistry
This post explores the steps that can sometimes trap therapeutic discovery teams and CROs on the path of building their own bioregistry without examining the tradeoffs with buying an existing system.
So you've decided to build your own bioregistry. Congratulations! Bioregistries are a critical piece of infrastructure that support FAIR data and avoid the pitfalls of science-ing by spreadsheet. You've got your text editor open, you've picked your web framework of choice, and you've written SQL statements to build out the data model in your database for your company's particular therapeutic. You'll add a web form for your colleagues to enter all of their amazing new molecules, fire up an AWS machine (or Docker container) for hosting, and the data will come pouring in!
You remembered to capture information about the other critical reagents, right? You didn't? That's okay - you'll ask everyone for a Type and add a few extra fields to your web form to cover all of the different things. Now you're ready - scientists just pick a type and fill out whatever fields they need. Data entry couldn't be easier!
You remembered to support bulk registration in your web form, right? Scientists hate having to enter everything one-by-one. You didn't? That's okay - there are lots of options for spreadsheet-like forms that you can use for bulk registration. You'll just swap out your web form for one that uses this new widget and your scientists will be all set.
What's that? Entering sequence information is getting confusing with all of the different types of things folks are recording? That's fine, it's a bit more work on your end but it's no big deal: you'll create different database models for every reagent and have a different form for each one, too. This will make entering and searching through the data so much easier, anyway! It's only one additional SQL script to migrate the data, no big deal.
Speaking of which, you remembered to differentiate between DNA and amino acid sequences, right? And validate what scientists are recording? So that everyone enters the same kinds of things? No? That's okay, you've got a bit of time - you'll do a one-off project to standardize on one kind of sequence or another for all of your database models.
Wait, nobody can figure out which plasmid expresses which protein? They were trying to do it with naming conventions but nobody's following the convention? That's okay, you can just figure it out in your web application. When scientists are looking at information about a protein, you'll just check for the plasmids that translate to its amino acid sequence, and ...
The site just crashed? It's crashing pretty regularly? It looks like everyone really took advantage of those bulk registration forms, and now you have a lot of plasmids and proteins. Doing all of that translating is more than your web server can bear! But that's okay, you're a savvy programmer - you can just store the translations with the plasmid to make it easier to look everything up!
Uh oh - someone says the sequence you're storing doesn't match the protein they wanted to express, and now they've made the wrong protein! How could that happen? You remembered to update all of the stored values whenever anyone updated a sequence, right? You didn't?!?! Well, hopefully you can squeeze in the time to do an audit and get everything properly aligned. And don't forget to correct those sequences whenever they're updated!
Who made those sequence changes, anyway? You remembered to add an audit log, right? No? Well, better late than never - you may never know how those sequence changes were made, but at least you'll have captured everyone's changes going forward.
Wait, now everyone wants to know where their reagents are kept? And how much they have left? Why didn't anyone mention this to you at the start?! That's a whole new layer to the data model that you'll have to design - and you're already busy maintaining the bioregistry the way it is now!
Who decided to build this bioregistry, anyway?!