Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Chinese characters in prompt generation stage #1168

Open
Tu-Zhenzhao opened this issue May 20, 2024 · 1 comment
Open

Support Chinese characters in prompt generation stage #1168

Tu-Zhenzhao opened this issue May 20, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@Tu-Zhenzhao
Copy link

Tu-Zhenzhao commented May 20, 2024

System Info

pandasai == 2.0.43
python == 3.11

馃悰 Describe the bug

I was trying to use Field Descriptions feature to improve the understanding of my dataset to LLMs. The way I am doing is write a data description function to create a dictionary info about dataset then pass then to pandasai through Field Descriptions like this:

data = preview_data(df)
# define a connector
connector = PandasConnector({"original_df": df}, name='My Connector', field_descriptions=data)

My part of data looks like this:

{'鏃堕棿': 'The 鏃堕棿 column contains string values. The unique values are: 2023-6-14, 2022-4-22, 2022-11-5.'}

As you can see there is some Chinese characters, but in the prompt_generation stage, the Chinese characters was not decoded thus it looks like this:

dfs[0]:
  name: My Connector
  description: null
  type: pd.DataFrame
  rows: 28
  columns: 18
  schema:
    fields:
    - name: "\u65F6\u95F4"
      type: object
      samples:
      - 2022-4-22
      - 2022-11-5
      - 2023-6-14

Which makes LLM much more confused "\u65F6\u95F4".

Is any way we solve this problem? Any suggestion will be grateful!

Copy link

dosubot bot commented May 20, 2024

To address the issue with Chinese characters not being properly decoded during the prompt generation stage in pandasai, it's crucial to understand the encoding of your original data source and ensure it's UTF-8 or compatible. Additionally, knowing the Python version you're using and whether this issue occurs with other non-ASCII characters can help pinpoint the problem. A minimal, reproducible example of the code causing this issue would also be invaluable for finding a solution. This information will enable a more targeted approach to solving the decoding issue with Chinese characters.

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 馃憤 if this is helpful and 馃憥 if it is irrelevant.

To continue the conversation, mention @dosu.

@dosubot dosubot bot added the bug Something isn't working label May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant