Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove use of deprecated record_batches_to_json_rows function from arrow-json #24981

Open
hiltontj opened this issue May 9, 2024 · 6 comments
Assignees
Labels

Comments

@hiltontj
Copy link
Contributor

hiltontj commented May 9, 2024

We are using a function from arrow-json crate that is deprecated: record_batches_to_json_rows

It is used in the following places:

  1. &arrow_json::writer::record_batches_to_json_rows(batches.as_slice())?,
  2. let json_rows = record_batches_to_json_rows(&[&batch])

The deprecation warning indicates that we should be using Writer. Doing so in the latter case should be easier since we are serializing directly to the bytes outputted in the API response, however, the former may be a bit trickier, as we are modifying the serialized JSON values before serializing to bytes in the response.

@hiltontj hiltontj added the v3 label May 9, 2024
@JeanArhancet
Copy link
Contributor

I am interested in taking on this task. Is it possible to assign this ticket to me?

@hiltontj
Copy link
Contributor Author

hiltontj commented Jun 6, 2024

Hi @JeanArhancet - that would be awesome, thank you! This issue is not in our immediate backlog, so if you would like to tackle it by opening a PR, you are welcome to do that.

My recommendation is to start by solving (1.) in the issue description, and opening a PR for only that. Since that is using arrow-json's writer for its intended purpose: serializing and writing JSON. Specifically, I would look at arrow-json's ArrayWriter, as that serializes to a JSON array, as opposed to JSON Lines. Separately, we do intend to support JSON Lines (see #24654).

As for (2.), the use of record_batches_to_json_rows is a bit of a hack there, so the solution is not to use the arrow-json writer, but will be something different.

Please keep in mind that you will need to read and sign our CLA in order for contributions to be accepted. You can find that here: https://www.influxdata.com/legal/cla/

@JeanArhancet
Copy link
Contributor

Thanks for your feedback, @hiltontj.

I've opened a PR for the first part: #25046

@hiltontj
Copy link
Contributor Author

hiltontj commented Jun 7, 2024

Awesome, thank you @JeanArhancet! I had some comments on that PR.

Re: (2.), I will clarify a bit, in case you want to tackle that in a separate PR. The record_batches_to_json_rows function is being used in the v1 query API handler here:

// See https://github.com/influxdata/influxdb/issues/24981
#[allow(deprecated)]
let json_rows = record_batches_to_json_rows(&[&batch])
.context("failed to convert RecordBatch to JSON rows")?;
for json_row in json_rows {
let mut row = vec![Value::Null; self.column_map.len()];
for (k, v) in json_row {
if k == INFLUXQL_MEASUREMENT_COLUMN_NAME
&& (self.buffer.current_measurement_name().is_none()
|| self
.buffer
.current_measurement_name()
.is_some_and(|n| *n != v))
{
// we are on the "iox::measurement" column, which gives the name of the time series
// if we are on the first row, or if the measurement changes, we push into the
// buffer queue
self.buffer
.push_next_measurement(v.as_str().with_context(|| {
format!("{INFLUXQL_MEASUREMENT_COLUMN_NAME} value was not a string")
})?);
} else if k == INFLUXQL_MEASUREMENT_COLUMN_NAME {
// we are still working on the current measurement in the buffer, so ignore
continue;
} else {
// this is a column value that is part of the time series, add it to the row
let j = self.column_map.get(&k).unwrap();
row[*j] = if let (Some(precision), TIME_COLUMN_NAME) = (self.epoch, k.as_str())
{
// specially handle the time column if `epoch` parameter provided
convert_ns_epoch(v, precision)?
} else {
v
};
}
}
self.buffer.push_row(Row(row))?;
}

It converts the arrow RecordBatches to serde_json::Values so that it can manipulate them and then convert them into the locally defined Row type.

The objective here would be to remove the use of record_batches_to_json_rows and serde_json::Value, and operate on the RecordBatches directly.

This would likely involve a bit more refactoring effort than solving (1.) did, but if you're up for giving it a shot, then go for it!

@mgattozzi
Copy link
Contributor

Closing since #25046 is merged

@mgattozzi mgattozzi reopened this Jun 11, 2024
@mgattozzi
Copy link
Contributor

Realizing there were more parts to be added so I've reopened my bad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants