![]() Ydata-profiling: Data Profiling Report - Dataset Overview. In what concerns the overall characteristics of the data, all the information we were looking for is included in the Overview section: We’ll go through the various sections of the report in the following sections. The above code generates a complete profiling report of the data, which we can use to further move our EDA process, without the need to write any more code! Profiling Report of the Adult Census Dataset, using ydata-profiling. However, we can do this - and guess what, all of the subsequent EDA tasks! - in a single line of code, using ydata-profiling: Printing the existing categories and respective frequencies for each categorical feature in data. We could use a df.describe(include='object') to print out some additional information on categorical features (count, unique, mode, frequency), but a simple check of existing categories would involve something a little more verbose:ĭataset Overview: Adult Census Dataset. This however, only considers numeric features. Snippet by Author.Īll in all, the output format is not ideal… If you’re familiar with pandas, you’ll also know the standard modus operandi of starting an EDA process - df.describe():Īdult Dataset: Main statistics presented with df.describe(). Number of observations, features, feature types, duplicated rows, and missing values. ![]() With some pandas manipulation and the right cheatsheet, we could eventually print out the above information with some short snippets of code:ĭataset Overview: Adult Census Dataset. We need to have a deep understanding of our data to handle it efficiently in future machine learning tasksĪs a rule of thumb, we traditionally start by characterizing the data relatively to the number of observations, number and types of features, overall missing rate, and percentage of duplicate observations. When we first get our hands on an unknown dataset, there is an automatic thought that pops up right away: What am I working with? Step 1: Data Overview an Descriptive Statistics To demonstrate best practices and investigate insights, we’ll be using the Adult Census Income Dataset, freely available on Kaggle or UCI Repository (License: CC0: Public Domain). In this article, we’ll dive into each step of an effective EDA process, and discuss why you should turn ydata-profiling into your one-stop shop to master it. Unless you pick the right tool for the job.
0 Comments
Leave a Reply. |