Design and Architecture considerations for storing time series data
Know your Access patterns in advance
- Are we going to do analysis on full day of data or data for just one hour. Having advance notes on what would be the use cases with which data will be used is highly recommended.
- The granularity of information required by the client application helps deciding underlining data model for storing the information.
- Frequency with which data is generated
- Identify the speed with which time is being produced by the source system. Do you produce multiple data points every second.
Though we might need to persist all the time series data but more than often we don't need to store each data point as a separate record in the database.
Most of the time series problems are similar in nature and predominant issues comes when we need to Scale the system . Making the systems evolve with changing schema is another dimension which adds to the complexity. All the problem show similar patterns with only variations in the data model.
If we can define a time window and store all the readings for that period of time as an array then
we can significantly cut the number of actual records persisted in the database, improving the performance of the system.
Stocktock tick information is generated once a second for each stock i.e. 86K ticks per stock per day.
If we store that many records as separate rows then the time complexity to access this information would be huge, so we can group 5 minutes or 1 hour or one day worth of records as a single vector record.
The benefits of storing information in larger chunks is obvious as you would do way fewer lookups into the
NoSQL store to fetch the information for a specific period of time. Other point to remember is that
if your window size is very small then you will be doing a lot of read/write operation and if it is
too big then durability would be a concern as you can lose the information in the event of system failure.
So you need to balance out both the forces.
There is no single size fit each time series problem is different.Fine tune the system based on the requirements and the access patterns.
If the access pattern changes in future then you might have to re-index, re-calculate the array size to optimize your queries.
So each time-series application is very custom made, where you can apply the best practices but cannot just import the data modelling templates to a different time-series problem.