Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

enforce_privacy dose not work? #1145

Open
gDanzel opened this issue May 4, 2024 · 3 comments
Open

enforce_privacy dose not work? #1145

gDanzel opened this issue May 4, 2024 · 3 comments

Comments

@gDanzel
Copy link

gDanzel commented May 4, 2024

System Info

OS version: win11
Python version: 3.11
The current version of pandasai being used: 2.0.36

馃悰 Describe the bug

The sample data appears in the prompt even set enforce_privacy True.

The Code below:

import pandasai.pandas as pd
from pandasai import Agent
from pandasai.helpers import get_openai_callback
from pandasai.llm import OpenAI, GoogleGemini

from data.sample_dataframe import dataframe

llm = OpenAI()

agent = Agent([pd.DataFrame(dataframe)], config={"llm": llm, "enforce_privacy": True, "verbose": True})
with get_openai_callback() as cb:
    response = agent.chat("Get the top 3 GDP countries.")
    print(response)
    print(cb)

And can see the print out of prompt, the dataframe still with data:

2024-05-04 15:08:41 [INFO] Question: Get the top 3 GDP countries.
2024-05-04 15:08:42 [INFO] Running PandasAI with openai LLM...
2024-05-04 15:08:42 [INFO] Prompt ID: 50302077-57f3-482a-a823-64e2be596f5d
2024-05-04 15:08:42 [INFO] Executing Pipeline: GenerateChatPipeline
2024-05-04 15:08:42 [INFO] Executing Step 0: ValidatePipelineInput
2024-05-04 15:08:42 [INFO] Executing Step 1: CacheLookup
2024-05-04 15:08:42 [INFO] Executing Step 2: PromptGeneration
2024-05-04 15:08:46 [INFO] Using prompt: <dataframe>
dfs[0]:10x3
country,gdp,happiness_index
Spain,19294482071552,6.38
Japan,14631844184064,7.23
China,3435817336832,7.22
</dataframe>




Update this initial code:
\```python
\# TODO: import the required dependencies
import pandas as pd

\# Write code here

\# Declare result var: 
type (possible values "string", "number", "dataframe", "plot"). Examples: { "type": "string", "value": f"The highest salary is {highest_salary}." } or { "type": "number", "value": 125 } or { "type": "dataframe", "value": pd.DataFrame({...}) } or { "type": "plot", "value": "temp_chart.png" }

### QUERY
Get the top 3 GDP countries.

Variable dfs: list[pd.DataFrame] is already declared.

At the end, declare "result" variable as a dictionary of type and value.

If you are asked to plot a chart, use "matplotlib" for charts, save as png.

Generate python code and return full updated code:
2024-05-04 15:08:46 [INFO] Executing Step 3: CodeGenerator
2024-05-04 15:08:49 [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-05-04 15:08:49 [INFO] Prompt used:

dfs[0]:10x3
country,gdp,happiness_index
Spain,19294482071552,6.38
Japan,14631844184064,7.23
China,3435817336832,7.22

Update this initial code:

# TODO: import the required dependencies
import pandas as pd

# Write code here

# Declare result var: 
type (possible values "string", "number", "dataframe", "plot"). Examples: { "type": "string", "value": f"The highest salary is {highest_salary}." } or { "type": "number", "value": 125 } or { "type": "dataframe", "value": pd.DataFrame({...}) } or { "type": "plot", "value": "temp_chart.png" }

QUERY

Get the top 3 GDP countries.

Variable dfs: list[pd.DataFrame] is already declared.

At the end, declare "result" variable as a dictionary of type and value.

If you are asked to plot a chart, use "matplotlib" for charts, save as png.

Generate python code and return full updated code:

2024-05-04 15:08:49 [INFO] Code generated:
```
# TODO: import the required dependencies
import pandas as pd

Write code here

top_3_gdp_countries = dfs[0].nlargest(3, 'gdp')

Declare result var

result = {
"type": "dataframe",
"value": top_3_gdp_countries
}
```

2024-05-04 15:08:49 [INFO] Executing Step 4: CachePopulation
2024-05-04 15:08:49 [INFO] Executing Step 5: CodeCleaning
2024-05-04 15:08:49 [INFO]
Code running:

top_3_gdp_countries = dfs[0].nlargest(3, 'gdp')
result = {'type': 'dataframe', 'value': top_3_gdp_countries}
        ```
2024-05-04 15:08:49 [INFO] Executing Step 6: CodeExecution
2024-05-04 15:08:49 [INFO] Executing Step 7: ResultValidation
2024-05-04 15:08:49 [INFO] Answer: {'type': 'dataframe', 'value':          country             gdp  happiness_index
0  United States  19294482071552             6.94
9          China  14631844184064             5.12
8          Japan   4380756541440             5.87}
2024-05-04 15:08:49 [INFO] Executing Step 8: ResultParsing
         country             gdp  happiness_index
0  United States  19294482071552             6.94
9          China  14631844184064             5.12
8          Japan   4380756541440             5.87
Tokens Used: 340
	Prompt Tokens: 270
	Completion Tokens: 70
Total Cost (USD): $ 0.000240

Process finished with exit code 0
@Hrishikesh-Dutta0078
Copy link

Facing same issue. enforce privacy is working till v2.0.28.

@patlac
Copy link

patlac commented May 30, 2024

I think it's due to how pandasai/helpers/dataframe_serializer.py -> convert_df_to_csv() doesn't care at all about the enforce_privacy config setting, it's not checking for it, neither does it check for custom_head.

it happily just adds the details:

# Add dataframe details
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df.to_csv()}"

Until this gets properly fixed, I replaced above code with:

# TEMP FIX: Do not add dataframe details
df_without_sample_data = pd.DataFrame(columns=df.pandas_df.columns)
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df_without_sample_data.to_csv()}"

In contrast, pandasai/helpers/dataframe_serializer.py -> convert_df_to_json() properly checks for enforce_privacy and custom_head

Related: #1147

@patlac
Copy link

patlac commented May 31, 2024

After some more digging, it seems you can get enforce_privacy and custom head to work by forcing it to use the YML/json serialization, you just need to specify field descriptions.

If you add field descriptions,

convert_df_to_yml() will be used if you provide field descriptions...

# If field descriptions are added always use YML. Other formats don't support field descriptions yet
   if self.field_descriptions or self.connector_relations:
        serializer = DataframeSerializerType.YML

..and then...

    def serialize(
        self,
        df: pd.DataFrame,
        extras: dict = None,
        type_: DataframeSerializerType = DataframeSerializerType.YML,
    ) -> str:
        if type_ == DataframeSerializerType.YML:
            return self.convert_df_to_yml(df, extras)
        elif type_ == DataframeSerializerType.JSON:
            return self.convert_df_to_json_str(df, extras)
        elif type_ == DataframeSerializerType.SQL:
            return self.convert_df_sql_connector_to_str(df, extras)
        else:
            return self.convert_df_to_csv(df, extras)

convert_df_to_yml() will serialize the field descriptions in YML, and internally use convert_df_to_json() to do the rest (respecting enforce_privacy and custom head.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants