Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amend RFC 1000: Revision ids should be hashes of the revision contents #2

Open
tailhook opened this issue Feb 3, 2020 · 3 comments
Labels
enhancement New feature or request

Comments

@tailhook
Copy link
Contributor

tailhook commented Feb 3, 2020

Motivation

Each revision is a file in dbschema/migrations/* per spec. Files are stored in revision control, so technically can be edited after being registered in the database server. Even if we explicitly document that editing files is supported, users can still fail on fixing merge conflicts. The erroneous migration is hard to fix if we don't have hashes because it's unclear which contents is the source of truth if two branches were merged (i.e. id is opaque, and any revision could be the right one depending on which branch was applied first, and order might be different on different staging servers or even canary deployments).

Assumptions

It should be possible to hash revision contents based solely on tokenization of the file without understanding the semantics of the data.

Alternatively, it might be possible to implement full AST or simplified AST (parse tree) of the revision file and hash that, if some forms of statements are ambiguous.

@tailhook tailhook added the enhancement New feature or request label Feb 3, 2020
@elprans
Copy link
Member

elprans commented Jun 15, 2020

The current spec explicitly avoids talking about the schema contents. The only thing that is currently enforced is the order of migrations. The main reason is that we want the migration scripts to be editable.

That said, your comment made me realize that the RFC does not explicitly talk about the "latest migration" record, or the migration history "HEAD" if you will. This could be a simple file with the id of the topmost migration.

@tailhook
Copy link
Contributor Author

The current spec explicitly avoids talking about the schema contents. The only thing that is currently enforced is the order of migrations. The main reason is that we want the migration scripts to be editable.

Sure. That's why it's important to have hashes. it's fine to edit migrations until you applied them to all persistent servers (i.e. perhaps production, but maybe also staging that you can't restore from earlier backup if you need to). And editing later means revisions are out of sync in different installations of the database (i.e. we will never reapply revision with id already applied, even if that is edited). And this can lead to the revision mess that goes unnoticed quite some time and which is hard to fix later on.

It's not clear from the initial proposal, but my current idea is that hash of the migration is computed by migration tool, at the time of execution. So text is editable, but if you edited migration that already applied, it conflicts. You have two options then: revert that file or restore the database from the backup.

I know that we are used to editing that in alembic (an presumably all other existing tools), but alembic does not store history in the database that can go out of sync. So if you need to fix something in alembic you can run DDLs manually and change the revision. In our case, as long as I understand, migrations in the edgedb and in files will be out of sync forever at this point, with no way to find out what was actually applied at any given point.

That said, all of it is very important for schema migration, an might be much less important for the data migrations. Applying different data migrations on staging and production might even be useful, but not at the cost of editing migration history.

That said, your comment made me realize that the RFC does not explicitly talk about the "latest migration" record, or the migration history "HEAD" if you will. This could be a simple file with the id of the topmost migration.

Not sure it's important. I think alembic treats as "HEAD" anything that that has no child revisions.

@elprans
Copy link
Member

elprans commented Jun 17, 2020

OK, we can make a note of the migration content hash when the migration is committed. If such integrity is the only concern, then this can be a field in the migration object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants