Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch a property for a list of QIDs in CSV format #148

Open
tuukka opened this issue May 6, 2021 · 6 comments
Open

Fetch a property for a list of QIDs in CSV format #148

tuukka opened this issue May 6, 2021 · 6 comments
Labels

Comments

@tuukka
Copy link

tuukka commented May 6, 2021

Is there a good way to fetch one or multiple properties for a (potentially long) list of QIDs in CSV format?

Here's example code for what I have this far using wd convert but would it make sense for it to support --format csv and fetching more than one property at a time?

$ echo Q3572332 Q98407233 Q10428420 | wd convert --subjects --property P6375 | jq -r '
  to_entries
    | .[]
    | .key as $qcode
    | .value[] as $address
    | [$qcode,$address]
    | @csv'
"Q3572332","Eläintarhantie 1"
"Q3572332","Siltasaarenkatu 18"
"Q98407233","Agricolankatu 1-3"
"Q10428420","Fleminginkatu 1"
"Q10428420","Porthaninkatu 12"
"Q10428420","Viides linja 11"

Or the same using wd data: (but does it fetch all item data and would it be more difficult to implement --format csv?)

$ echo Q3572332 Q98407233 Q10428420 | wd data --simplify --props claims.P6375 | jq -r '
  .id as $qcode 
    | .claims.P6375[] as $address
    | [$qcode,$address]
    | @csv'
"Q3572332","Eläintarhantie 1"
"Q3572332","Siltasaarenkatu 18"
"Q98407233","Agricolankatu 1-3"
"Q10428420","Viides linja 11"
"Q10428420","Fleminginkatu 1"
"Q10428420","Porthaninkatu 12"
maxlath added a commit that referenced this issue May 6, 2021
@maxlath
Copy link
Owner

maxlath commented May 6, 2021

Is there a good way to fetch one or multiple properties for a (potentially long) list of QIDs

For one property, wd convert seems to do the job, but it would currently not work for multiple properties. You could write a SPARQL request extending what wd convert does, but would need to handle the split into batches (wd convert uses batches of a 1000 at once)

in CSV format

It can get tricky to get from JSON with deeply nested objects to CSV, but could work for some basic cases.

but does it fetch all item data

No, but almost: when you specify --props claims.P6375, the smallest amount of data we can request to the API is basic info + all the claims by setting props=claims

would it be more difficult to implement --format csv?

I gave it a try in the this branch. The proposed syntax would be:

echo Q3572332 Q98407233 Q10428420 | wd data --props claims.P6375 --format csv

and output

id,claims.P6375
Q3572332,"Eläintarhantie 1,Siltasaarenkatu 18"
Q98407233,Agricolankatu 1-3
Q10428420,"Viides linja 11,Fleminginkatu 1,Porthaninkatu 12"

Note that P6375 values are grouped per entity: we could generate several rows per entity as in your version, but I'm not sure how we could make it work for cases where there are several properties (generating all combinations seems unnecessarily verbose). Would that work for your use case?

@tuukka
Copy link
Author

tuukka commented May 6, 2021

Thank you for the quick implementation!

I was thinking this would be useful in lots of use cases, but my current use case is trying to find matches between certain Wikidata items and another big dataset (OpenStreetMap) based on street addresses. In this case, I need separate rows for each address to see if any of them match, and if I matched on multiple properties, it would be preferable to get all the combinations to see if any of them match a combination present in the other dataset. Could it make sense to do that by default and have an option like --join-values , to get your current output?

Multiple values is the difficult part also in the sense that before today I had no idea how to do the above in jq. I can manage now but I would not want to suggest anyone to learn this. 😅 (This made it click in the manual: "Thus as functions as something of a foreach loop.")

@maxlath
Copy link
Owner

maxlath commented May 6, 2021

I'm very grateful that you posted those jq commands, I use jq a lot but never encountered those as before, quite powerful ^^

@tuukka
Copy link
Author

tuukka commented May 6, 2021

I have to add I'm not saying the solution for now couldn't be to include an example like these in wikibase-cli's documentation and people can use them as templates for what they need.

maxlath added a commit that referenced this issue May 7, 2021
@maxlath
Copy link
Owner

maxlath commented May 7, 2021

I pushed more commits on that branch: now echo Q3572332 Q98407233 Q10428420 | wd data --props claims.P6375 --format csv outputs

id,claims.P6375
Q3572332,Eläintarhantie 1
Q3572332,Siltasaarenkatu 18
Q98407233,Agricolankatu 1-3
Q10428420,Viides linja 11
Q10428420,Fleminginkatu 1
Q10428420,Porthaninkatu 12

but the previous behaviour can, as suggested, be recovered with --join. Ready to merge, or do you see any missing feature?

@tuukka
Copy link
Author

tuukka commented May 7, 2021

I tested the current version briefly and I would have wanted to specify a custom separator instead of the comma as an argument to --join as e.g. addresses often contain commas in them.

Also, I expected adding a claim to just result in an added column to the non-joined results, but of course, it turned on the joined mode. I understand this avoids combinatorial explosions but is it more important than consistency? echo Q3572332 Q98407233 Q10428420 | PATH=bin:$PATH wd data --props claims.P6375,claims.P4595 --format csv:

id,claims.P6375,claims.P4595
Q3572332,"Eläintarhantie 1,Siltasaarenkatu 18",Helsinki
Q98407233,Agricolankatu 1-3,Helsinki
Q10428420,"Viides linja 11,Fleminginkatu 1,Porthaninkatu 12",Helsinki

(By the way, I also noticed that the argument to format is not validated as I sometimes typed "CSV" instead of "csv".)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants