fbdesignpro / sweetviz
- вторник, 7 июля 2020 г. в 00:21:47
Python
Visualize and compare datasets, target values and associations, with one line of code.
Sweetviz is an open source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with a single line of code. Output is a fully self-contained HTML application.
The system is built around quickly visualizing target values and comparing datasets. Its goal is to help quick analysis of target characteristics, training vs testing data, and other such data characterization tasks.
Note: Sweetviz is in the ALPHA TESTING PHASE. Core functionality is complete, please let me know if you run into any data, compatibility or install issues! Thank you for reporting any BUGS in the issue tracking system here, and I welcome your feedback and questions on usage/features in our Discourse server (you should be able to log in with your Github account!).
See an example report from the Titanic dataset HERE
Sweetviz currently supports Python 3.6+ and Pandas 0.25.3+. Reports are output using the base "os" module, so custom environments such as Google Colab which require custom file operations are not yet supported, although I am looking into a solution.
The best way to install sweetviz (other than from source) is to use pip:
pip install sweetviz
Create a DataframeReport
object, then use a show_xxx
function to render the report.
Note: Currently the only rendering supported is to a standalone HTML file, using a "widescreen" aspect ratio (i.e. 1080p resolution or wider). Please let me know of formats/resolutions you would like to be supported in our Discourse Forum.
There are 3 main functions for creating reports:
To analyze a single dataframe, simply use the analyze(...)
function, then the show_html(...)
function:
import sweetviz as sv
my_report = sv.analyze(my_dataframe)
my_report.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"
When run, this will output a 1080p widescreen html app in your default browser:
The analyze()
function can take multiple other arguments:
analyze(source: Union[pd.DataFrame, Tuple[pd.DataFrame, str]],
target_feat: str = None,
feat_cfg: FeatureConfig = None,
pairwise_analysis: str = 'auto'):
my_df
or [my_df, "Training"]
skip
, force_cat
, force_num
and force_text
. The "force_" arguments override the built-in type detection. They can be constructed as follows:feature_config = sv.FeatureConfig(skip="PassengerId", force_text=["Age"])
pairwise_analysis="on"
(or ="off"
) since processing that many features would take a long time. This parameter also covers the generation of the association graphs (based on Drazen Zaric's concept):
To compare two data sets, simply use the compare()
function. Its parameters are the same as analyze()
, except with an inserted second parameter to cover the comparison dataframe. It is recommended to use the [dataframe, "name"] format of parameters to better differentiate between the base and compared dataframes. (e.g. [my_df, "Train"]
vs my_df
)
my_report = sv.compare([my_dataframe, "Training Data"], [test_df, "Test Data"], "Survived", feature_config)
Another way to get great insights is to use the comparison functionality to split your dataset into 2 sub-populations.
Support for this is built in through the compare_intra()
function. This function takes a boolean series as one of the arguments, as well as an explicit "name" tuple for naming the (true, false) resulting datasets. Note that internally, this creates 2 separate dataframes to represent each resulting group. As such, it is more of a shorthand function of doing such processing manually.
my_report = sv.compare_intra(my_dataframe, my_dataframe["Sex"] == "male", ["Male", "Female"], feature_config)
The package contains an INI file for configuration. You can override any setting by providing your own then calling this before creating a report:
sv.config_parser.read("Override.ini")
You can look into the file sweetviz_defaults.ini
for what can be overriden (warning: much of it is a work in progress and not well documented). One example is to remove the logo from the report, so it may be used more readily in a business setting. You would create your own Override.ini
and put the following lines:
[Layout]
show_logo = 0
This is my first open-source project! I built it to be the most useful tool possible and help as many people as possible with their data science work. If it is useful to you, your contribution is more than welcome and can take many forms:
A STAR here on GitHub, and a Twitter or Instagram post are the easiest contribution and can potentially help grow this project tremendously! If you find this project useful, these quick actions from you would mean a lot and could go a long way.
Kaggle notebooks/posts, Medium articles, YouTube video tutorials and other content take more time but will help all the more!
I expect there to be many quirks once the project is used by more and more people with a variety of new (& "unclean") data. If you found a bug, please open a new issue here.
To make Sweetviz as useful as possible we need to hear what you would like it to do, or what it could do better! Head on to our Discourse server and post your suggestions there; no login required!.
I definitely welcome the help I can get on this project, simply get in touch on the issue tracker and/or our Discourse forum.
Please note that after a hectic development period, the code itself right now needs a bit of cleanup. :)
I want Sweetviz to be a hub of the best of what's out there, a way to get the most valuable information and visualization, without reinventing the wheel.
As such, I want to point some of those great resources that were inspiring and integrated into Sweetviz: