-
Notifications
You must be signed in to change notification settings - Fork 856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] aggregate function that operates on vector(array of numeric) data #15741
Comments
This kind of operations is not natively supported, unfortunately. The fundamental issue is that pandas allows you to put arbitrary objects into a Series/DataFrame and it will run Python operations on them. In this case, since you put numpy arrays in, pandas will happily just leave them as numpy arrays and use binary operations on numpy array so this works as expected. cudf does not support arbitrary objects in this way, so we have to be a bit more clever about rearranging the data ourselves to handle this kind of operation. Per-row array data is supported through the list dtype, which is what your'e getting in the
This outputs:
|
@vyasr Thanks for your suggestion. The suggestion you gave above is equivalent to splitting array into separate columns, then apply sum()/mean() on each column, and merge the output back into an array? |
Yes, that is basically equivalent. You cannot operate on the numpy arrays directly, but assuming they are all of the same length you could split them into multiple columns if you have control of that on construction. Otherwise the list-based approach I showed is the way you could process it if you have to take the numpy array-based inputs from pandas as-is. |
@Rhett-Ying does the above solution address your needs? |
For more advanced operations, yes it will depend. If operators already exist in other libraries like cuVS those will almost certainly be faster than any apply-based solution you come up with in just cudf. In general, if you are trying to do vectorized operations on homogeneous vectors (i.e. something that would fit in a square matrix, or a higher-order tensor, and not needing a ragged list), you will likely have better luck implementing those types of operations performantly in cupy. That's also true on the host: you would probably get better performance using numpy operations than pandas operations for something like a manual kNN implementation since with numpy you can devolve directly to its vectorized operations (implemented in C) whereas with pandas you introduce some extra layers of Python. |
What is your question?
I am wondering if
cudf
has native or built-in support for aggregate function that run against vector data. Namley, text/image embeddings are stored in the column of csv/parquet file. And I'd like to run various aggregate functions such asmean
,max
and so on. All these operations are element-wise, namely, it returns the mean of all the values in same index and return an array with same lenght. What's more, I'd like to run K-Nearest-Neighbor search as well.If not natively supported, how to achieve these operations with performance efficient?
example code:
The text was updated successfully, but these errors were encountered: