Welcome! This is the first publication of our Data Scientist posts in which we will be working on Data Science Topics (Data Visualisation, Clustering Techniques, Regression/Classification/Time Series Models, use cases, Feature Engineering, etc.).
The goal is to cover all the concepts a Data Scientist will need to have a successful career and share the code (Python). And what better way to start this serie than to review the fundamentals of statistics. Independently of the level one might have, it is always important to refresh those key concepts before going into complex models.
Find the notebook on our GitHub or by using the following link to get access to the code. The code is in Python and covers the different measures of central tendency, of volatility, descriptive statistics functions and correlations concepts.
Find the notebook by clicking here.
While we recommend you to go through each concept in the notebook, you will find in the following section the most important concepts that often generate confusion (to us):
- Arithmetic Mean > Geometric mean > Harmonic Mean
- The standard deviation is the positive square root of the sample variance.
- The population variance have the same formula as the sample variance but use 𝑛 in the denominator instead of 𝑛 − 1. (ddof = 0)
- The population standard deviation is the positive square root of the population variance.
- Negative Skewness -> Dominant tail on the left side. Positive skewness -> longer or fatter tail on the right side. Skewness is close to 0 (between −0.5 and 0.5) -> dataset is considered quite symmetrical.
- Variance refers to the spread of a data set around its mean value, while a covariance refers to the measure of the directional relationship between two random variables.
- The Covariance quantify the strength and direction of a relationship between a pair of variables whereas we can think of the Correlation coefficient (or Pearson product-moment correlation coefficient) as a standardized covariance. Covariance shows you how the two variables differ, whereas correlation shows you how the two variables are related.
- Correlation coefficient = Covariance / (std_x * std_y)
This is the already the end for this article, thanks for reading our first article regarding the fundamentals for Data Science. We will be posting more about those topics in the following weeks!
If you have any questions or recommendations about other related topics let us know in the comments! You can also follow us on Linkedin, Instagram, TikTok and on our mailing list where we share exclusive content and Weekly Reports containing top news and investments of the week.
2 Responses
An intriguing discussion is definitely worth comment. Theres no doubt that that you should write more on this subject, it might not be a taboo subject but usually people dont talk about these topics. To the next! Best wishes!!