In today’s digital age, data is often seen as a valuable asset, and it’s not uncommon for organizations to accumulate and store large amounts of it, all of it costs money.
Let’s say we create a new web page and we would like to determine the number of page views for each page on our website, to do so we will analyze data from an Apache HTTP server access log. To protect the privacy of our users, we will not store their IP addresses, as this is considered personally identifiable information (PII). Instead, we will focus on collecting information about the pages visited, the time of the visit, and the type of device used (mobile or desktop).
Our example line from the Apache server access log:
xxx.xxx.xxx.xxx - - [31/Dec/2022:05:04:36 +0000] "GET /subjects/development/web-development HTTP/1.1" 200 24701 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
Parsing the apache style log we are able to get the following (for Device we parse the user agent)
"31/Dec/2022:05:04:36 +0000","/subjects/development/web-development","Desktop"
The initial log size is 235 bytes, and the data we care about is 78 bytes it’s 3 times less without any additional optimization we still can go one more step and convert time to timestamp and convert Mobile, Desktop, and Tablet to 1,2 and 3.
"1672481076","/subjects/development/web-development","2"
Which will reduce the data size to 57 bytes (4 times less than the original) without changing its value of it.
Let’s say we have 10 000 000 views per day across 5 different pages from 3 different devices.
Our original log file will have 10 000 000 x 235 bytes / 1024 ^ 3 it will end up being 2.1GB while our optimized data set will be 10 000 000 x 78 bytes / 1024 ^ 3 -> 0.72 GB
If we store data in S3 for 12 months it will cost us: $211 per year. 2.1GB * 365 -> 0.766 TB * $23 => $17.61 per month or $211 per year. Our optimized data set would cost us: $70.
Making it 3 times cheaper we can save 3 times more data (3 years instead of 1).
However, this is not it… Let’s say we do not care about time only the date our final data set can be even smaller: 1 day, 3 devices, 5 pages we will end up having just 15 rows, if each row is 100bytes we will end up having 15 000 bytes or 0,000013969838619 GB and store each day for a year it will cost you $ 0,001374337443849 per year…
I’m not suggesting you should convert all your data to an optimal format. I’m only offering to introduce a certain lifecycle policy for how long data can be in each format.
Raw log - 1 month - total storage 60GB - cost per year $16
More details (keeping time information) - 6 months - total storage 140GB - cost per year $35
Agggregate per day - last 10 years - 0.05GB - cost per year $0.013
The total cost would be less than $50 to store 10 years’ worth of data. If we store everything in log format it would be $2110 in the optimized format $700. By doing this analysis we can answer how much we are willing to pay for storage on this metric (pageviews) 2k, 700, or 50 dollars per year.
Do you know how much you spend for each metric and if it is worth it?
Overall, it’s important to be mindful of the amount of data you store and to regularly review your data. By doing so, you not only can save money but also improve security, and do your part to protect the environment.