pyspark

Here are 3,387 public repositories matching this topic...

aehabV / Indeed-fake-job-posting-prediction

A machine learning model is built using PySpark's MLlib library to automatically flag suspicious job postings on Indeed.com. The dataset includes 18,000 job descriptions, out of which about 800 are fake.

nlp natural-language-processing pyspark indeed pyspark-mllib fake-jobposts-prediction job-postings

Updated May 18, 2023
Jupyter Notebook

basel-ay / Hands-on-Apache-Spark

Star

Writing dummy snippets of code to read, manipulate, and build a simple ML model with PySpark.

apache-spark linear-regression pyspark

Updated Jul 18, 2023
Jupyter Notebook

zuliani99 / All-Pairs-Docs-Similarity

Star

Given a set of documents and the minimum required similarity threshold find the number of document pairs that exceed the threshold

sklearn pyspark tf-idf cosine-similarity document-similarity beir

Updated May 26, 2023
Jupyter Notebook

phricardorj / pyspark-study

Star

🐍 | My PYSPARK studies. PySpark is an interface for Apache Spark in Python.

pyspark phyton

Updated Nov 11, 2022
Jupyter Notebook

JonathanPollyn / Spark

Star

This notebook contains detailed code for spark and machine learning and databricks

python spark pyspark spark-sql pyspark-python

Updated Mar 15, 2023
Jupyter Notebook

data-miner00 / spark

Star

A laboratory to carry out experiments with PySpark

python pyspark databricks

Updated Nov 5, 2023
Jupyter Notebook

khaledshabasy / Data-Modeling-Spark-udacity-capstone

Star

An ETL pipeline for I94 immigration, global land temperatures and US demographics datasets is created to form an analytics database on immigration events. A data model is established with pandas and pyspark to find patterns of immigration to the United States.

aws s3 pandas pyspark sas7bdat

Updated Apr 6, 2023
Jupyter Notebook

furkancets / PrescreiberPipelineSpark

Star

Trying best case apache spark working environment for robust data pipelines

spark apache-spark hadoop pyspark

Updated Apr 1, 2023
Python

simonediluna / Distributed-Data-Analysis-and-Mining

Star

An academic project carried out for the Distributed Data Analysis and Mining course (a. y. 2022/2023)

distributed-systems data-science pyspark

Updated May 18, 2023
Jupyter Notebook

milesgranger / pontem

Star

Treat Spark like pandas.

pandas pyspark dataframes dataframe-api spark-dataframes distributed-dataframe

Updated Sep 3, 2017
Python

SreekarJammula / tf-idf-

Star

The current assignment is to write the python scripts for Apache Spark. The tasks are divided into three parts as below: WordCount-To count the occurrences of words in a book on a per-book basis and compare the results with those of Assignment1. pyspark.ml. feature- To count the tf-idf values for the unigram and bigrams using the pyspark.ml.feat…

apache-spark pyspark tf-idf spark-ml