Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise bitcoin export so it doesn't query all partitions every day #6

Open
medvedev1088 opened this issue Sep 22, 2020 · 1 comment

Comments

@medvedev1088
Copy link
Member

medvedev1088 commented Sep 22, 2020

Right now load_dag scans data in all partitions every day. In particular enrich transactions sql https://github.com/blockchain-etl/bitcoin-etl-airflow/blob/master/dags/resources/stages/enrich/sqls/transactions.sql needs to join inputs and outputs and requires scanning all past data.

An alternative is to enrich transactions in export_dag using https://github.com/blockchain-etl/bitcoin-etl#enrich_transactions.

This will reduce the BigQuery costs significantly.


This might also require changing timestamp field type from int to iso 8601 in export jobs (breaking compatibility change so will bump the version to 2.*), so that raw tables can be partitioned by this field. Now the raw tables are not partitioned which makes the enrich job scan whole table.

@sfsf9797
Copy link

sfsf9797 commented Apr 27, 2022

hi @medvedev1088, just checking if this issue is still open since it has been a while since it was created.

Anyway, I think I might be able to contribute to this by:

  1. add the enrich _transactions command into the build_export_dag.py
  2. remove the part of the code that does enrich in build_load_dag.py and make some other necessary changes.

finally, I would validate the results!
what do you think? please let me know if any concerns or suggestions. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants