DynamoDB discoveries
A lot of preparation needs to go into a DynamoDB design.
- normalised fields where strings are lowercased
- composite keys can be quite loaded
- tag use case, BatchWriteItem might come into play
- versioning
- getting design support from AWS
- SDK pain like getting to know dynamoattribute / expression builder and v1/v2 confusion
- Are projections premature optimisation?
- code struct doesn't document the table, people describe their setup in a README
- Are keys Capitalised? Rich likes them Generic for Appsync
- Scans versus Query... is it ever OK to do a scan?
- How to store time? UTC? You want it in ISO8601 type format to make it queryable as opposed to epoch
- Designs are rarely captured in code README doc example
Operational considerations:
- Use on-demand scaling but that's 6x more expensive
- Predicting cost isn't easy until you know the X requests per second. And some people tend to think you you should be limits in, whilst developing.
Good resources
Who to follow on Twitter, when it comes to sharing DynamoDB knowledge
Case study
Via #databases on https://awsdevelopers.slack.com/, Martin asks:
Hello, everyone. I’m pretty new with DynamoDB and I need some advice on which will be the best Partition Key and Sort Key for my use case. We are migration this table jobs which is not complex but have thousands of queries per hour by date, basically always is #date between :start_time and :end_time. Below the specification
table name: jobs
Id: unique string
created_at: date time, format ISO 8601: 20150311T122706Z
Data: string
This is an example how we need access the data
tableName = 'jobs'
params = {
table_name: tableName,
key_condition_expression: "#date between :start_time and :end_time",
expression_attribute_names: {
"#date" => "created_at"
},
expression_attribute_values: {
":start_time" => "20150311T122706Z",
":end_time" => "20150313T122700Z"
}
}
DynamodbClient.client.query(params)
Michael says:
Depending on how many days you query at a time, you could even make the day (or month) the hash key, and the rest of the timestamp (plus a job id?) the range key. Advantages are that, assuming you have an even distribution of job creation times, each hash key shouldn't get too "hot" (ie, you'll only have as many in each hash as you create in a day, or month). Disadvantages are that querying multiple days (or months) will mean multiple queries (can do in parallel, but if you have hundreds, bleh). And if you're creating millions a day (or month) this wouldn't be suitable either
Rich says:
I'd be inclined to use the job ID as the PK for the table because changing that value requires writing a new record and removing the old one. I'd then create a secondary index with the PK based on what you're likely to query. For example: date (pk) + date/time (sk); "date + status" (pk) + time (sk); "date + hour + status" (pk) + time (sk). Now if you want all the pending jobs in the last hour you query the key "2019090906pending".