As more applications of artificial intelligence and machine learning methods to the discovery, development, and engineering of therapeutic antibodies emerge, companies seeking to take advantage of these new methods will have to adapt their existing software and database systems. There are five key considerations to effect such a transformation, and this article will touch on all five.
1. Clean, consistent, centralized data
Different software systems organize information in different ways, but to enable AI/ML support, companies need to ensure they have a centralized, consistently-formatted source of high-quality data. The shape this centralized data store takes will ultimately depend upon the needs of the data science team, but the steps required to provide it will be the same: extract, transform, and load. In the extract step, information is retrieved from a particular software system using either its API or a connection to its underlying database. In the transform step, the data is cleaned and converted into a format appropriate for the centralized data store. In the load step, the data is finally added to appropriate locations in the centralized data store. Extract, transform, and load - ETL - should be automated steps in a data processing pipeline that regularly ensures up-to-date information in the centralized data store.
2. ... and lots of it
Data is the raw material that powers AI and ML algorithms, and the remarkable (some might say unreasonable) effectiveness of data requires that it be present in large amounts. Microsoft demonstrated back in 2001 that ever-larger amounts of data have a greater effect on model accuracy than the use of any particular algorithm. The success of large language models trained on massive datasets further validates this finding. This means the centralized data store and the processes that supply it with data must be able to scale to accommodate millions of datapoints, and this kind of scale should be considered early in the planning stages of the centralized data store.
3. Metadata - data about your data
Providing data about the data you've collected - its metadata - can help data scientists use that data more effectively. Knowing the age, original source, and relative quality or certainty of each piece of data can help to weigh its contribution to a model. That information can also help data scientists troubleshoot issues with certain data sources. Some algorithms may also require that the data being modeled include the expected outcome from the model - that is, that the data be labeled. In general, labeled data is more expensive to obtain than unlabeled data, and thus the unlabeled data's metadata may be an important source of additional information. Plan for the capture and recording of metadata during development of ETL pipelines, since this information may be expensive to recover or reproduce later on.
4. Stream it or batch it?
Designing the centralized data store and the ETL processes that will populate it with data are a critical step, but when and how to run the ETL and model-building processes are also important decision. The two approaches are batch processing, where the processes are run at intervals, and stream processing, where the processes run continuously. Batch processing can often be the easiest approach: simply pick a time to run each process, then let the process run in its entirety. However, large amounts of data can cause a batch ETL process to take hours or even days to complete, and the computational demands of AI and ML algorithms may require equally as long or longer to produce a new model. Streaming updates mitigate this risk by incrementally adding data or adjusting models, but the infrastructure demands may be more complex, and not all AI and ML algorithms are amenable to this approach. Make sure your models, processes, and timeframes are aligned with your team's goals, systems, and expertise.
5. Design appropriate APIs
A clean, consistent, centralized data store that scales well is only useful if it's accessible to the team that will be using it. Designing APIs that provide the necessary access should flow from the needs of the data scientists who will be building the AI and ML models. In addition to API methods that provide required data in appropriate formats, consider the processes that the centralized data store will feed into. If models are generated via batch processing, APIs may need to provide data in digestible chunks, or for updateable models, data collected after a certain point in time. For stream processing, the centralized data store, or a program that runs in the background and monitors it, may need to send data as it's received to continuously update models.
Taking these five considerations into account will help your team effect a successful transition to an AI- and ML-supporting infrastructure. In future posts, we'll dive into more specifics around each consideration, including technologies and processes that can help with each one. If your team finds themselves struggling in any particular area, we're happy to assist, and if there are specific topics you'd like to know more about, please feel free to reach out with requests.