The idea of the changes summary file is to be a random-access way to find the basic metadata associated with a certain change. It’s an almost–append-only binary file with a fixed record size, allowing seek(change_id * record_size)
and change_id = tell() / record_size
to work without off-by-one errors, as well as stuff like iterating backwards to get recent changes and clever shit like that. It can have new changes added to it without rewriting the whole thing; but if it ever goes wrong, it can be painlessly, safely, atomically rebuilt from the changes.log.
I deliberately chose 1-indexing for change ids in changeview
so that the summary file can have record_size
bytes of header information, and not encounter off-by-one errors. There should also be a certain amount of footer data to allow the summary's fidelity to the original changes.log to be verified. (Basically, the last known modification date and size of changes.log.) This will be over-written when new changes are added. (Hence, ‘almost–append-only.’)
Strictly speaking, the only metadata we need for the summary file to do its job is the 'offset' of each change. (That is, the count of bytes from the start of the file to the first '<' of the change entry header.) But then if we want to know anything about the change entry itself, we have to actually open the file. We can make it much faster by including more information about each one.
Actually, the best trade-off between speed and size is probably to include "parsing hints" telling the offset and length of the author and path parts. Since the author name is always at entry offset + 4 bytes from the start of the file, and the path name offset can be similarly easily calculated if you know the length of the author's name, it's probably best to just include the length of each author name and path. This makes things much shorter than including the actual header info, while hopefully retaining roughly the same speed.
So, the format I'm proposing looks something like this (all big-endian, 'cos it makes sense):
offset := uint64 # 8 bytes authorlen := uint16 # 2 bytes pathlen := uint16 # 2 bytes timeinfo := time_t # 8 bytes (int64) record := offset authorlen pathlen timeinfo # 20 bytes header := "NWChangeSum" # magic number formatversion # 0x0001 padding # 7 * 0x00 footer := lastsize lastsize := uint64 SUMMARYFILE := header *record footer