Siuba: fast, flexible data science with python
Stage
README
This project has two major goals:
- help people learn data science with python.
- build siuba into fast and friendly tool for data analysis.
It can be hard to get started analyzing data, so a major piece will be figuring out some format for teaching beginners (e.g. running weekly analysis sessions, workshops, etc..).
Background
siuba is an open source python library designed for quick, interactive data analysis. It's a port of the tidyverse from R to python, and supports a tabular data analysis workflow centered on 5 common actions:
select()
- keep certain columns of data.filter()
- keep certain rows of data.mutate()
- create or modify an existing column of data.summarize()
- reduce one or more columns down to a single number.arrange()
- reorder the rows of data.
These actions can be preceded by group_by()
, which causes them to be applied individually
to grouped rows of data. Moreover, many SQL concepts—such as distinct()
, count()
, and joins are
implemented. siuba can operate both on pandas DataFrames or SQL databases (e.g. postgres, duckdb, snowflake, bigquery).
For more information, see the siuba guide, or this 2020 RStudioConf talk.
Roles Needed
Anyone interested in data analysis hop aboard!:
- User Researchers - data analysis are often built by engineers with little user feedback. Help surface what people need to analyze data quickly.
- Data Scientists - we need people willing to try siuba on practice data (like tidytuesday), and senior folks willing to help out.
- Analytics Engineers - we need folks who use dbt, so we can figure out how data scientists can move their work back into it.
- Software Engineers - siuba as a python library has a lot of tricky problems, like needing to execute code against different backends (pandas, SQL), and test each backend.