AutoML

🔍

question:what are all topics in this content: Logis&c( Regression( Classifica&on( Machine(Learning( Andrew(Ng( Classifica(on+ Email:(Spam(/(Not(Spam?( Online(Transac&ons:(Fraudulent((Yes(/(No)?( Tumor:(Malignant(/(Benign(?( 0:(“Nega&ve(Class”((e.g.,(benign(tumor)( 1:(“Posi&ve(Class”((e.g.,(malignant(tumor)( Andrew(Ng( Tumor(Size( Threshold(classifier(output(((((((((((((at(0.5:( If((((((((((((((((((((((((,(predict(“y(=(1”( If((((((((((((((((((((((((,(predict(“y(=(0”( Tumor(Size( Malignant(?( (Yes)(1( (No)(0( Andrew(Ng( Classifica&on:((((y(((=(((0(((or(((1( can(be(>(1(or(<(0( Logis&c(Regression:( Logis&c( Regression( Hypothesis( Representa&on( Machine(Learning( Andrew(Ng( Sigmoid(func&on( Logis&c(func&on( Logis(c+Regression+Model+ Want( 1( 0.5( 0( Andrew(Ng( Interpreta(on+of+Hypothesis+Output+ =(es&mated(probability(that(y(=(1(on(input(x(( Tell(pa&ent(that(70%(chance(of(tumor(being(malignant(( Example:((If(( “probability(that(y(=(1,(given(x,( ((parameterized(by((((”( Logis&c( Regression( Decision(boundary( Machine(Learning( Andrew(Ng( Logis(c+regression+ ((Suppose(predict(“((((((((((“(if( ((((predict(“((((((((((“((if( z 1 Andrew(Ng( x1( x2( Decision+Boundary+ 1 2 3 1 2 3 Predict(“((((((((((“(if(( Andrew(Ng( Non?linear+decision+boundaries+ x1( x2( Predict(“((((((((((“(if(( x1( x2( `1 1 `1 1 Logis&c( Regression( Cost(func&on( Machine(Learning( Andrew(Ng( Training( set:( How(to(choose(parameters((((?( m(examples( Andrew(Ng( Cost+func(on+ Linear(regression:( “non`convex”( “convex”( Andrew(Ng( Logis(c+regression+cost+func(on+ If(y(=(1( 0( 1( Andrew(Ng( Logis(c+regression+cost+func(on+ If(y(=(0( 0( 1( Logis&c( Regression( Simplified(cost(func&on( and(gradient(descent( Machine(Learning( Andrew(Ng( Logis(c+regression+cost+func(on+ Andrew(Ng( Output(( Logis(c+regression+cost+func(on+ To(fit(parameters((((:(( To(make(a(predic&on(given(new(((:( Andrew(Ng( Gradient+Descent+ Want((((((((((((((((((((:( Repeat (simultaneously(update(all((((()( Andrew(Ng( Gradient+Descent+ Want((((((((((((((((((((:( (simultaneously(update(all((((()( Repeat Algorithm(looks(iden&cal(to(linear(regression!( Logis&c( Regression( Advanced(( op&miza&on( Machine(Learning( Andrew(Ng( Op(miza(on+algorithm+ Cost(func&on(((((((((.(Want((((((((((((((((((((.( Given((((,(we(have(code(that(can(compute( ` (( ` (( (for((((((((((((((((((((((((((((()( Repeat Gradient(descent:( Andrew(Ng( Op(miza(on+algorithm+ Given((((,(we(have(code(that(can(compute( ` (( ` (( (for((((((((((((((((((((((((((((()( Op&miza&on(algorithms:( ` Gradient(descent( ` Conjugate(gradient( ` BFGS( ` L`BFGS( Advantages:( ` No(need(to(manually(pick(( ` Oeen(faster(than(gradient( descent.( Disadvantages:( ` More(complex( Andrew(Ng( Example:( function [jVal, gradient] = costFunction(theta) jVal = (theta(1)-5)^2 + ... (theta(2)-5)^2; gradient = zeros(2,1); gradient(1) = 2*(theta(1)-5); gradient(2) = 2*(theta(2)-5); options = optimset(‘GradObj’, ‘on’, ‘MaxIter’, ‘100’); initialTheta = zeros(2,1); [optTheta, functionVal, exitFlag] ... = fminunc(@costFunction, initialTheta, options); Andrew(Ng( gradient(1) = [ ]; function [jVal, gradient] = costFunction(theta) theta = jVal = [ ]; gradient(2) = [ ]; gradient(n+1) = [ ]; code(to(compute( code(to(compute( code(to(compute( code(to(compute( Logis&c( Regression( Mul&`class(classifica&on:( One`vs`all( Machine(Learning( Andrew(Ng( Mul(class+classifica(on+ Email(foldering/tagging:(Work,(Friends,(Family,(Hobby( Medical(diagrams:(Not(ill,(Cold,(Flu( Weather:(Sunny,(Cloudy,(Rain,(Snow( Andrew(Ng( x1( x2( x1( x2( Binary(classifica&on:( Mul&`class(classifica&on:( Andrew(Ng( x1( x2( One?vs?all+(one?vs?rest):+ Class(1:( Class(2:( Class(3:( x1( x2( x1( x2( x1( x2( Andrew(Ng( One?vs?all+ Train(a(logis&c(regression(classifier(((((((((((((((for(each( class((((to(predict(the(probability(that(((((((((((.( On(a(new(input((((,(to(make(a(predic&on,(pick(the( class((((that(maximizes( 4. Process Modeling 4.1. Introduction to Process Modeling 4.1.4. What are some of the different statistical methods for model building? 4.1.4.1. Linear Least Squares Regression Modeling Workhorse Linear least squares regression is by far the most widely used modeling method. It is what most people mean when they say they have used "regression", "linear regression" or "least squares" to fit a model to their data. Not only is linear least squares regression the most widely used modeling method, but it has been adapted to a broad range of situations that are outside its direct scope. It plays a strong underlying role in many other modeling methods, including the other methods discussed in this section: nonlinear least squares regression, weighted least squares regression and LOESS. Definition of a Linear Least Squares Model Used directly, with an appropriate data set, linear least squares regression can be used to fit the data with any function of the form in which 1. each explanatory variable in the function is multiplied by an unknown parameter, 2. there is at most one unknown parameter with no corresponding explanatory variable, and 3. all of the individual terms are summed to produce the final function value. In statistical terms, any function that meets these criteria would be called a "linear function". The term "linear" is used, even though the function may not be a straight line, because if the unknown parameters are considered to be variables and the explanatory variables are considered to be known coefficients corresponding to those "variables", then the problem becomes a system (usually overdetermined) of linear equations that can be solved for the values of the unknown parameters. To differentiate the various meanings of the word "linear", the linear models being discussed here are often said to be "linear in the parameters" or "statistically linear". Why "Least Squares"? Linear least squares regression also gets its name from the way the estimates of the unknown parameters are computed. The "method of least squares" that is used to f(x;⃗ β) ⃗ = β0 + β1x1 + β2x2 + … obtain parameter estimates was independently developed in the late 1700's and the early 1800's by the mathematicians Karl Friedrich Gauss, Adrien Marie Legendre and (possibly) Robert Adrain [Stigler (1978)] [Harter (1983)] [Stigler (1986)] working in Germany, France and America, respectively. In the least squares method the unknown parameters are estimated by minimizing the sum of the squared deviations between the data and the model. The minimization process reduces the overdetermined system of equations formed by the data to a sensible system of , (where is the number of parameters in the functional part of the model) equations in unknowns. This new system of equations is then solved to obtain the parameter estimates. To learn more about how the method of least squares is used to estimate the parameters, see Section 4.4.3.1. Examples of Linear Functions As just mentioned above, linear models are not limited to being straight lines or planes, but include a fairly wide range of shapes. For example, a simple quadratic curve, is linear in the statistical sense. A straight-line model in , or a polynomial in , is also linear in the statistical sense because they are linear in the parameters, though not with respect to the observed explanatory variable, . Nonlinear Model Example Just as models that are linear in the statistical sense do not have to be linear with respect to the explanatory variables, nonlinear models can be linear with respect to the explanatory variables, but not with respect to the parameters. For example, is linear in , but it cannot be written in the general form of a linear model presented above. This is because the slope of this line is expressed as the product of two parameters. As a result, nonlinear least squares regression could be used to fit this model, but linear least squares cannot be used. For further examples and discussion of nonlinear models see the next section, Section 4.1.4.2. Advantages of Linear Least Squares Linear least squares regression has earned its place as the primary tool for process modeling because of its effectiveness and completeness. p p p f(x; β) = + x + , ⃗ β0 β1 β11x 2 log(x) f(x; β) = + ln(x), ⃗ β0 β1 sin(x) f(x; β) = + sin(x) + sin(2x) + sin(3x), ⃗ β0 β1 β2 β3 x f(x; β) = + x ⃗ β0 β0β1 x Though there are types of data that are better described by functions that are nonlinear in the parameters, many processes in science and engineering are well-described by linear models. This is because either the processes are inherently linear or because, over short ranges, any process can be well-approximated by a linear model. The estimates of the unknown parameters obtained from linear least squares regression are the optimal estimates from a broad class of possible parameter estimates under the usual assumptions used for process modeling. Practically speaking, linear least squares regression makes very efficient use of the data. Good results can be obtained with relatively small data sets. Finally, the theory associated with linear regression is wellunderstood and allows for construction of different types of easily-interpretable statistical intervals for predictions, calibrations, and optimizations. These statistical intervals can then be used to give clear answers to scientific and engineering questions. Disadvantages of Linear Least Squares The main disadvantages of linear least squares are limitations in the shapes that linear models can assume over long ranges, possibly poor extrapolation properties, and sensitivity to outliers. Linear models with nonlinear terms in the predictor variables curve relatively slowly, so for inherently nonlinear processes it becomes increasingly difficult to find a linear model that fits the data well as the range of the data increases. As the explanatory variables become extreme, the output of the linear model will also always more extreme. This means that linear models may not be effective for extrapolating the results of a process for which data cannot be collected in the region of interest. Of course extrapolation is potentially dangerous regardless of the model type. Finally, while the method of least squares often gives optimal estimates of the unknown parameters, it is very sensitive to the presence of unusual data points in the data used to fit a model. One or two outliers can sometimes seriously skew the results of a least squares analysis. This makes model validation, especially with respect to outliers, critical to obtaining sound answers to the questions motivating the construction of the model. 4. Process Modeling 4.1. Introduction to Process Modeling 4.1.4. What are some of the different statistical methods for model building? 4.1.4.2. Nonlinear Least Squares Regression Extension of Linear Least Squares Regression Nonlinear least squares regression extends linear least squares regression for use with a much larger and more general class of functions. Almost any function that can be written in closed form can be incorporated in a nonlinear regression model. Unlike linear regression, there are very few limitations on the way parameters can be used in the functional part of a nonlinear regression model. The way in which the unknown parameters in the function are estimated, however, is conceptually the same as it is in linear least squares regression. Definition of a Nonlinear Regression Model As the name suggests, a nonlinear model is any model of the basic form, in which 1. the functional part of the model is not linear with respect to the unknown parameters, , and 2. the method of least squares is used to estimate the values of the unknown parameters. Due to the way in which the unknown parameters of the function are usually estimated, however, it is often much easier to work with models that meet two additional criteria: 3. the function is smooth with respect to the unknown parameters, and 4. the least squares criterion that is used to obtain the parameter estimates has a unique solution. These last two criteria are not essential parts of the definition of a nonlinear least squares model, but are of practical importance. Examples of Nonlinear Models Some examples of nonlinear models include: y = f(x;⃗ β) + ε , ⃗ β0, β1, … f(x; β) ⃗ = β0 + β1x 1 + β2x Advantages of Nonlinear Least Squares The biggest advantage of nonlinear least squares regression over many other techniques is the broad range of functions that can be fit. Although many scientific and engineering processes can be described well using linear models, or other relatively simple types of models, there are many other processes that are inherently nonlinear. For example, the strengthening of concrete as it cures is a nonlinear process. Research on concrete strength shows that the strength increases quickly at first and then levels off, or approaches an asymptote in mathematical terms, over time. Linear models do not describe processes that asymptote very well because for all linear functions the function value can't increase or decrease at a declining rate as the explanatory variables go to the extremes. There are many types of nonlinear models, on the other hand, that describe the asymptotic behavior of a process well. Like the asymptotic behavior of some processes, other features of physical processes can often be expressed more easily using nonlinear models than with simpler model types. Being a "least squares" procedure, nonlinear least squares has some of the same advantages (and disadvantages) that linear least squares regression has over other methods. One common advantage is efficient use of data. Nonlinear regression can produce good estimates of the unknown parameters in the model with relatively small data sets. Another advantage that nonlinear least squares shares with linear least squares is a fairly well-developed theory for computing confidence, prediction and calibration intervals to answer scientific and engineering questions. In most cases the probabilistic interpretation of the intervals produced by nonlinear regression are only approximately correct, but these intervals still work very well in practice. Disadvantages of Nonlinear Least Squares The major cost of moving to nonlinear least squares regression from simpler modeling techniques like linear least squares is the need to use iterative optimization procedures to compute the parameter estimates. With functions that are linear in the parameters, the least squares estimates of the parameters can always be obtained analytically, while that is generally not the case with nonlinear models. The use of iterative procedures requires the user to provide starting values for the unknown parameters before the software can begin the optimization. The starting values must be reasonably close to the as yet unknown parameter estimates or the optimization procedure may not converge. Bad starting values can also cause the f(x; β) ⃗ = β1x β2 f(x; β) = + exp(− x) ⃗ β0 β1 β2 f(x;⃗ β) = sin( + ) + cos( + ) ⃗ β1 β2 β3x1 β4 β5 β6x2 software to converge to a local minimum rather than the global minimum that defines the least squares estimates. Disadvantages shared with the linear least squares procedure includes a strong sensitivity to outliers. Just as in a linear least squares analysis, the presence of one or two outliers in the data can seriously affect the results of a nonlinear analysis. In addition there are unfortunately fewer model validation tools for the detection of outliers in nonlinear regression than there are for linear regression. Home Aniruddha Bhandari — Updated On October 27th, 2023 Beginner Data Engineering Python Regression Structured Data Technique Master the Art of Data Science: A Step-by-Step Guide to Becoming an Industry-Ready Data Scientist Download Roadmap × Feature engineering is a critical step in building accurate and effective machine learning models. One key aspect of feature engineering is scaling, normalization, and standardization, which involves transforming the data to make it more suitable for modeling. These techniques can help to improve model performance, reduce the impact of outliers, and ensure that the data is on the same scale. In this article, we will explore the concepts of scaling, normalization, and standardization, including why they are important and how to apply them to different types of data. By the end of this article, you’ll have a thorough understanding of these essential feature engineering techniques and be able to apply them to your own machine learning projects. Table of contents What is Feature Scaling? Why Should we Use Feature Scaling? What is Normalization? What is Standardization? The Big Question – Normalize or Standardize? Implementing Feature Scaling in Python Comparing Unscaled, Normalized, and Standardized Data Applying Scaling to Machine Learning Algorithms Frequently Asked Questions What is Feature Scaling? Feature scaling is a data preprocessing technique used to transform the values of features or variables in a dataset to a similar scale. The purpose is to ensure that all features contribute equally to the model and to avoid the domination of features with larger values. Feature scaling becomes necessary when dealing with datasets containing features that have different ranges, units of measurement, or orders of magnitude. In such cases, the variation in feature values can lead to biased model performance or difficulties during the learning process. There are several common techniques for feature scaling, including standardization, normalization, and min-max scaling. These methods adjust the feature values while preserving their relative relationships and distributions. By applying feature scaling, the dataset’s features can be transformed to a more consistent scale, making it easier to build accurate and effective machine learning models. Scaling facilitates meaningful comparisons between features, improves model convergence, and prevents certain features from overshadowing others based solely on their magnitude. Why Should we Use Feature Scaling? Some machine learning algorithms are sensitive to feature scaling, while others are virtually invariant. Let’s explore these in more depth: 1. Gradient Descent Based Algorithms Machine learning algorithms like linear regression, logistic regression, neural network, PCA (principal component analysis), etc., that use gradient descent as an optimization technique require data to be scaled. Take a look at the formula for gradient descent below: The presence of feature value X in the formula will affect the step size of the gradient descent. The difference in the ranges of features will cause different step sizes for each feature. To ensure that the gradient descent moves smoothly towards the minima and that the steps for gradient descent are updated at the same rate for all the features, we scale the data before feeding it to the model. 2. Distance-Based Algorithms Distance algorithms like KNN, K-means clustering, and SVM(support vector machines) are most affected by the range of features. This is because, behind the scenes, they are using distances between data points to determine their similarity. For example, let’s say we have data containing high school CGPA scores of students (ranging from 0 to 5) and their future incomes (in thousands Rupees): Since both the features have different scales, there is a chance that higher weightage is given to features with higher magnitudes. This will impact the performance of the machine learning algorithm; obviously, we do not want our algorithm to be biased towards one feature. “ Having features on a similar scale can help the gradient descent converge more quickly towards the minima. ‘ Therefore, we scale our data before employing a distance based algorithm so that all the features contribute equally to the result. The effect of scaling is conspicuous when we compare the Euclidean distance between data points for students A and B, and between B and C, before and after scaling, as shown below: Distance AB before scaling => Distance BC before scaling => Distance AB after scaling => Distance BC after scaling => 3. Tree-Based Algorithms Tree-based algorithms, on the other hand, are fairly insensitive to the scale of the features. Think about it, a decision tree only splits a node based on a single feature. The decision tree splits a node on a feature that increases the homogeneity of the node. Other features do not influence this split on a feature. So, the remaining features have virtually no effect on the split. This is what makes them invariant to the scale of the features! What is Normalization? Normalization is a data preprocessing technique used to adjust the values of features in a dataset to a common scale. This is done to facilitate data analysis and modeling, and to reduce the impact of different scales on the accuracy of machine learning models. Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. Itis also known as Min-Max scaling. Here’s the formula for normalization: Here, Xmax and Xmin are the maximum and the minimum values of the feature, respectively. When the value of X is the minimum value in the column, the numerator will be 0, and hence X’ is 0 On the other hand, when the value of X is the maximum value in the column, the numerator is equal to the denominator, and thus the value of X’ is 1 If the value of X is between the minimum and the maximum value, then the value of X’ is between 0 and 1 What is Standardization? Standardization is another scaling method where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero, and the resultant distribution has a unit standard deviation. Here’s the formula for standardization: is the mean of the feature values and is the standard deviation of the feature values. Note that, in this case, the values are not restricted to a particular range. Now, the big question in your mind must be when should we use normalization and when should we use standardization? Let’s find out! The Big Question – Normalize or Standardize? Normalization Standardization Rescales values to a range between 0 and 1 Centers data around the mean and scales to a standard deviation of 1 Useful when the distribution of the data is unknown or not Gaussian Useful when the distribution of the data is Gaussian or unknown Sensitive to outliers Less sensitive to outliers Retains the shape of the original distribution Changes the shape of the original distribution May not preserve the relationships between the data points Preserves the relationships between the data points Equation: (x – min)/(max – min) Equation: (x – mean)/standard deviation However, at the end of the day, the choice of using normalization or standardization will depend on your problem and the machine learning algorithm you are using. There is no hard and fast rule to tell you when to normalize or standardize your data. You can always start by fitting your model to raw, normalized, and standardized data and comparing the performance for the best results. It is a good practice to fit the scaler on the training data and then use it to transform the testing data. This would avoid any data leakage during the model testing process. Also, the scaling of target values is generally not required. Implementing Feature Scaling in Python Now comes the fun part – putting what we have learned into practice. I will be applying feature scaling to a few machine-learning algorithms on the Big Mart dataset. I’ve taken on the DataHack platform. I will skip the preprocessing steps since they are out of the scope of this tutorial. But you can find them neatly explained in this article. Those steps will enable you to reach the top 20 percentile on the hackathon leaderboard, so that’s worth checking out! view raw So, let’s first split our data into training and testing sets: Python Code: @ShilpiMazumdar1/feature_scaling A Nix repl by ShilpiMazumdar1 Open on Replit Show code 0 Run 42 SH Before moving to the feature scaling part, let’s glance at the details of our data using the pd.describe() method: We can see that there is a huge difference in the range of values present in our numerical features: Item_Visibility, Item_Weight, Item_MRP, and Outlet_Establishment_Year. Let’s try and fix that using feature scaling! Note: You will notice negative values in the Item_Visibility feature because I have taken log-transformation to deal with the skewness in the feature. Normalization Using sklearn (scikit-learn) To normalize your data, you need to import the MinMaxScaler from the sklearn library and apply it to our dataset. So, let’s do that! NormalizationVsStandarization_2.py hosted with ❤ by GitHub Let’s see how normalization has affected our dataset: 1 # data normalization with sklearn 2 from sklearn.preprocessing import MinMaxScaler 3 4 # fit scaler on training data 5 norm = MinMaxScaler().fit(X_train) 6 7 # transform training data 8 X_train_norm = norm.transform(X_train) 9 10 # transform testing dataabs 11 X_test_norm = norm.transform(X_test) view raw All the features now have a minimum value of 0 and a maximum value of 1. Perfect! Try out the above code in the live coding window below!! @LakshayArora1/FeatureScaling A Python repl by LakshayArora1 Open on Replit Show code 0 Run 22 LA Next, let’s try to standardize our data. Standardization Using sklearn To standardize your data, you need to import the StandardScaler from the sklearn library and apply it to our dataset. Here’s how you can do it: NormalizationVsStandarization_3.py hosted with ❤ by GitHub 1 # data standardization with sklearn 2 from sklearn.preprocessing import StandardScaler 3 4 # copy of datasets 5 X_train_stand = X_train.copy() 6 X_test_stand = X_test.copy() 7 8 # numerical features 9 num_cols = ['Item_Weight','Item_Visibility','Item_MRP','Outlet_Establishment_Year'] 10 11 # apply standardization on numerical features 12 for i in num_cols: 13 14 # fit on training data column 15 scale = StandardScaler().fit(X_train_stand[[i]]) 16 17 # transform the training data column 18 X_train_stand[i] = scale.transform(X_train_stand[[i]]) 19 20 # transform the testing data column 21 X_test_stand[i] = scale.transform(X_test_stand[[i]]) You would have noticed that I only applied standardization to my numerical columns, not the other One-Hot Encoded features. Standardizing the One-Hot encoded features would mean assigning a distribution to categorical features. You don’t want to do that! But why did I not do the same while normalizing the data? Because One-Hot encoded features are already in the range between 0 to 1. So, normalization would not affect their value. Right, let’s have a look at how standardization has transformed our data: The numerical features are now centered on the mean with a unit standard deviation. Awesome! Comparing Unscaled, Normalized, and Standardized Data It is always great to visualize your data to understand the distribution present. We can see the comparison between our unscaled and scaled data using boxplots. You can learn more about data visualization here. You can notice how scaling the features brings everything into perspective. The features are now more comparable and will have a similar effect on the learning models. Applying Scaling to Machine Learning Algorithms It’s now time to train some machine learning algorithms on our data to compare the effects of different scaling techniques on the algorithm’s performance. I want to see the effect of scaling on three algorithms in particular: KNearest Neighbors, Support Vector Regressor, and Decision Tree. K-Nearest Neighbors As we saw before, KNN is a distance-based algorithm that is affected by the range of features. Let’s see how it performs on our data before and after scaling: NormalizationVsStandarization_4.py hosted with ❤ by GitHub view raw You can see that scaling the features has brought down the RMSE score of our KNN model. Specifically, the normalized data performs a tad bit better than the standardized data. Note: I am measuring the RMSE here because this competition evaluates the RMSE. Support Vector Regressor SVR is another distance-based algorithm. So let’s check out whether it works better with normalization or standardization: 1 # training a KNN model 2 from sklearn.neighbors import KNeighborsRegressor 3 # measuring RMSE score 4 from sklearn.metrics import mean_squared_error 5 6 # knn 7 knn = KNeighborsRegressor(n_neighbors=7) 8 9 rmse = [] 10 11 # raw, normalized and standardized training and testing data 12 trainX = [X_train, X_train_norm, X_train_stand] 13 testX = [X_test, X_test_norm, X_test_stand] 14 15 # model fitting and measuring RMSE 16 for i in range(len(trainX)): 17 18 # fit 19 knn.fit(trainX[i],y_train) 20 # predict 21 pred = knn.predict(testX[i]) 22 # RMSE 23 rmse.append(np.sqrt(mean_squared_error(y_test,pred))) 24 25 # visualizing the result 26 df_knn = pd.DataFrame({'RMSE':rmse},index=['Original','Normalized','Standardized']) 27 df_knn 1 # training an SVR model 2 from sklearn.svm import SVR 3 # measuring RMSE score 4 from sklearn.metrics import mean_squared_error 5 6 # SVR 7 svr = SVR(kernel='rbf',C=5) 8 9 rmse = [] 10 11 # raw, normalized and standardized training and testing data 12 trainX = [X_train, X_train_norm, X_train_stand] 13 testX = [X_test, X_test_norm, X_test_stand] 14 15 # model fitting and measuring RMSE 16 for i in range(len(trainX)): 17 18 # fit 19 svr.fit(trainX[i],y_train) 20 # predict view raw view raw NormalizationVsStandarization_5.py hosted with ❤ by GitHub We can see that scaling the features does bring down the RMSE score. And the standardized data has performed better than the normalized data. Why do you think that’s the case? The sklearn documentation states that SVM, with RBF kernel, assumes that all the features are centered around zero and variance is of the same order. This is because a feature with a variance greater than that of others prevents the estimator from learning from all the features. Great! Decision Tree We already know that a Decision tree is invariant to feature scaling. But I wanted to show a practical example of how it performs on the data: NormalizationVsStandarization_6.py hosted with ❤ by GitHub 21 pred = svr.predict(testX[i]) 22 # RMSE 23 rmse.append(np.sqrt(mean_squared_error(y_test,pred))) 24 25 # visualizing the result 26 df_svr = pd.DataFrame({'RMSE':rmse},index=['Original','Normalized','Standardized']) 27 df_svr 1 # training a Decision Tree model 2 from sklearn.tree import DecisionTreeRegressor 3 # measuring RMSE score 4 from sklearn.metrics import mean_squared_error 5 6 # Decision tree 7 dt = DecisionTreeRegressor(max_depth=10,random_state=27) 8 9 rmse = [] 10 11 # raw, normalized and standardized training and testing data 12 trainX = [X_train,X_train_norm,X_train_stand] 13 testX = [X_test,X_test_norm,X_test_stand] 14 15 # model fitting and measuring RMSE 16 for i in range(len(trainX)): 17 18 # fit 19 dt.fit(trainX[i],y_train) 20 # predict 21 pred = dt.predict(testX[i]) 22 # RMSE 23 rmse.append(np.sqrt(mean_squared_error(y_test,pred))) 24 25 # visualizing the result 26 df_dt = pd.DataFrame({'RMSE':rmse},index=['Original','Normalized','Standardized']) 27 df_dt You can see that the RMSE score has not moved an inch on scaling the features. So rest assured when you are using tree-based algorithms on your data! Build Effective Machine Learning Models This tutorial covered the relevance of using feature scaling on your data and how normalization and standardization have varying effects on the working of machine learning algorithms. Remember that there is no correct answer to when to use normalization over standardization and vice-versa. It all depends on your data and the algorithm you are using. To enhance your skills in feature engineering and other key data science techniques, consider enrolling in our Data Science Black Belt program. Our comprehensive curriculum covers all aspects of data science, including advanced topics such as feature engineering, machine learning, and deep learning. With hands-on projects and mentorship, you’ll gain practical experience and the skills you need to succeed in this exciting field. Enroll today and take your data science skills to the next level! Frequently Asked Questions Q1. How is Standardization differentfrom Normalization feature scaling? A. Standardization centers data around a mean of zero and a standard deviation of one, while normalization scales data to a set range, often [0, 1], by using the minimum and maximum values. Q2. Why is Standardization used in machine learning? A. Standardization ensures algorithmic stability and prevents sensitivity to the scale of input features, improves optimization algorithms’ convergence and search efficiency, and enhances the performance of certain machine learning algorithms. Q3. Why is Normalization used in machine learning? A. Normalization helps in scaling the input features to a fixed range, typically [0, 1], to ensure that no single feature disproportionately impacts the results. It preserves the relationship between the minimum and maximum values of each feature, which can be important for some algorithms. It also improves the convergence and stability of some machine learning algorithms, particularly those that use gradient-based optimization. Q4. Why do we normalize values? A. We normalize values to bring them into a common scale, making it easier to compare and analyze data. Normalization also helps to reduce the impact of outliers and improve the accuracy and stability of statistical models. Q5. How do you normalize a set of values? A. To normalize a set of values, we first calculate the mean and standard deviation of the data. Then, we subtract the mean from each value and divide by the standard deviation to obtain standardized values with a mean of 0 and a standard deviation of 1. Alternatively, we can use other normalization techniques such as min-max normalization, where we scale the values to a range of 0 to 1, or unit vector normalization, where we scale the values to have a length of 1. Feature scaling feature scaling machine learning feature scaling python live coding normalizaiton vs. standardization normalization standardization “Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.” — Dr. Jason Brownlee This gives us an idea about feature engineering being the process of transforming data into features to act as inputs for machine learning models such that good quality features help in improving the overall model performance. Features are also very much dependent on the underlying problem. Thus, even though the machine learning task might be same in different scenarios, like classification of emails into spam and non-spam or classifying handwritten digits, the features extracted in each scenario will be very different from the other. Prof. Pedro Domingos from the University of Washington, in his paper titled, “A Few Useful Things to Know about Machine Learning” tells us the following. “At the end of the day, some machine learning projects succeed and some fail. What makes the dif erence? Easily the most important factor is the features used.” — Prof. Pedro Domingos The final quote which should motivate you about feature engineering is from renowned Kaggler, Xavier Conort. Most of you already know that tough real-world machine learning problems are often posted on Kaggle regularly which is usually open to everyone. “The algorithms we used are very standard for Kagglers. …We spent most of our ef orts in feature engineering. … We were also very careful to discard features likely to expose us to the risk of over-fitting our model.” — Xavier Conort Understanding Features A feature is typically a specific representation on top of raw data, which is an individual, measurable attribute, typically depicted by a column in a dataset. Considering a generic two-dimensional dataset, each observation is depicted by a row and each feature by a column, which will have a specific value for an observation. A generic dataset snapshot Thus like in the example in the figure above, each row typically indicates a feature vector and the entire set of features across all the observations forms a two-dimensional feature matrix also known as a feature-set. This is akin to data frames or spreadsheets representing two-dimensional data. Typically machine learning algorithms work with these numeric matrices or tensors and hence most feature engineering techniques deal with converting raw data into some numeric representations which can be easily understood by these algorithms. Features can be of two major types based on the dataset. Inherent raw features are obtained directly from the dataset with no extra data manipulation or engineering. Derived features are usually obtained from feature engineering, where we extract features from existing data attributes. A simple example would be creating a new feature “Age” from an employee dataset containing “Birthdate” by just subtracting their birth date from the current date. There are diverse types and formats of data including structured and unstructured data. In this article, we will discuss various feature engineering strategies for dealing with structured continuous numeric data. All these examples are a part of one of my recent books ‘Practical Machine Learning with Python’ and you can access relevant datasets and code used in this article on GitHub. A big shout out also goes to Gabriel Moreira who helped me by providing some excellent pointers on feature engineering techniques. Feature Engineering on Numeric Data Numeric data typically represents data in the form of scalar values depicting observations, recordings or measurements. Here, by numeric data, we mean continuous data and not discrete data which is typically represented as categorical data. Numeric data can also be represented as a vector of values where each value or entity in the vector can represent a specific feature. Integers and floats are the most common and widely used numeric data types for continuous numeric data. Even though numeric data can be directly fed into machine learning models, you would still need to engineer features which are relevant to the scenario, problem and domain before building a model. Hence the need for feature engineering still remains. Let’s leverage python and look at some strategies for feature engineering on numeric data. We load up the following necessary dependencies first (typically in a Jupyter notebook). import pandas as pd import matplotlib.pyplot as plt import numpy as np import scipy.stats as spstats %matplotlib inline Raw Measures Like we mentioned earlier, raw numeric data can often be fed directly to machine learning models based on the context and data format. Raw measures are typically indicated using numeric variables directly as features without any form of transformation or engineering. Typically these features can indicate values or counts. Let’s load up one of our datasets, the Pokémon dataset also available on Kaggle. poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8') poke_df.head() Snapshot of our Pokemon dataset Pokémon is a huge media franchise surrounding fictional characters called Pokémon which stands for pocket monsters. In short, you can think of them as fictional animals with superpowers! This dataset consists of these characters with various statistics for each character. Values If you closely observe the data frame snapshot in the above figure, you can see that several attributes represent numeric raw values which can be used directly. The following snippet depicts some of these features with more emphasis. poke_df[['HP', 'Attack', 'Defense']].head() Features with (continuous) numeric data Thus, you can directly use these attributes as features which are depicted in the above data frame. These include each Pokémon’s HP (Hit Points), Attack and Defense stats. In fact, we can also compute some basic statistical measures on these fields. poke_df[['HP', 'Attack', 'Defense']].describe() Basic descriptive statistics on numeric features With this you can get a good idea about statistical measures in these features like count, average, standard deviation and quartiles. Counts Another form of raw measures include features which represent frequencies, counts or occurrences of specific attributes. Let’s look at a sample of data from the millionsong dataset which depicts counts or frequencies of songs which have been heard by various users. popsong_df = pd.read_csv('datasets/song_views.csv', encoding='utf-8') popsong_df.head(10) Song listen counts as a numeric feature It is quite evident from the above snapshot that the listen_count field can be used directly as a frequencycount based numeric feature. Binarization Often raw frequencies or counts may not be relevant for building a model based on the problem which is being solved. For instance if I’m building a recommendation system for song recommendations, I would just want to know if a person is interested or has listened to a particular song. This doesn’t require the number of times a song has been listened to since I am more concerned about the various songs heshe has listened to. In this case, a binary feature is preferred as opposed to a count based feature. We can binarize our listen_count field as follows. watched = np.array(popsong_df['listen_count']) watched[watched >= 1] = 1 popsong_df['watched'] = watched You can also use scikit-learn's Binarizer class here from its preprocessing module to perform the same task instead of numpy arrays. from sklearn.preprocessing import Binarizer bn = Binarizer(threshold=0.9) pd_watched = bn.transform([popsong_df['listen_count']])[0] popsong_df['pd_watched'] = pd_watched popsong_df.head(11) Binarizing song counts You can clearly see from the above snapshot that both the methods have produced the same result. Thus we get a binarized feature indicating if the song was listened to or not by each user which can be then further used in a relevant model. Rounding Often when dealing with continuous numeric attributes like proportions or percentages, we may not need the raw values having a high amount of precision. Hence it often makes sense to round off these high precision percentages into numeric integers. These integers can then be directly used as raw values or even as categorical (discrete-class based) features. Let’s try applying this concept in a dummy dataset depicting store items and their popularity percentages. items_popularity = pd.read_csv('datasets/item_popularity.csv', encoding='utf-8') items_popularity['popularity_scale_10'] = np.array( np.round((items_popularity['pop_percent'] * 10)), dtype='int') items_popularity['popularity_scale_100'] = np.array( np.round((items_popularity['pop_percent'] * 100)), dtype='int') items_popularity Rounding popularity to different scales Based on the above ouputs, you can guess that we tried two forms of rounding. The features depict the item popularities now both on a scale of 1–10 and on a scale of 1–100. You can use these values both as numerical or categorical features based on the scenario and problem. Interactions Supervised machine learning models usually try to model the output responses (discrete classes or continuous values) as a function of the input feature variables. For example, a simple linear regression equation can be depicted as where the input features are depicted by variables having weights or coefficients denoted by respectively and the goal is to predict the response y. In this case, this simple linear model depicts the relationship between the output and inputs, purely based on the individual, separate input features. However, often in several real-world scenarios, it makes sense to also try and capture the interactions between these feature variables as a part of the input feature set. A simple depiction of the extension of the above linear regression formulation with interaction features would be where the features represented by denote the interaction features. Let’s try engineering some interaction features on our Pokémon dataset now. atk_def = poke_df[['Attack', 'Defense']] atk_def.head() From the output data frame, we can see that we have two numeric (continuous) features, Attack and Defence . We will now build features up to the 2nd degree by leveraging scikit-learn . from sklearn.preprocessing import PolynomialFeatures pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False) res = pf.fit_transform(atk_def) res Output ------ array([[ 49., 49., 2401., 2401., 2401.], [ 62., 63., 3844., 3906., 3969.], [ 82., 83., 6724., 6806., 6889.], ..., [ 110., 60., 12100., 6600., 3600.], [ 160., 60., 25600., 9600., 3600.], [ 110., 120., 12100., 13200., 14400.]]) The above feature matrix depicts a total of five features including the new interaction features. We can see the degree of each feature in the above matrix as follows. pd.DataFrame(pf.powers_, columns=['Attack_degree', 'Defense_degree']) Looking at this output, we now know what each feature actually represents from the degrees depicted here. Armed with this knowledge, we can assign a name to each feature now as follows. This is just for ease of understanding and you should name your features with better, easy to access and simple names. intr_features = pd.DataFrame(res, columns=['Attack', 'Defense', 'Attack^2', 'Attack x Defense', 'Defense^2']) intr_features.head(5) Numeric features with their interactions Thus the above data frame represents our original features along with their interaction features. Binning The problem of working with raw, continuous numeric features is that often the distribution of values in these features will be skewed. This signifies that some values will occur quite frequently while some will be quite rare. Besides this, there is also another problem of the varying range of values in any of these features. For instance view counts of specific music videos could be abnormally large (Despacito we’re looking at you!) and some could be really small. Directly using these features can cause a lot of issues and adversely affect the model. Hence there are strategies to deal with this, which include binning and transformations. Binning, also known as quantization is used for transforming continuous numeric features into discrete ones (categories). These discrete values or numbers can be thought of as categories or bins into which the raw, continuous numeric values are binned or grouped into. Each bin represents a specific degree of intensity and hence a specific range of continuous numeric values fall into it. Specific strategies of binning data include fixed-width and adaptive binning. Let’s use a subset of data from a dataset extracted from the 2016 FreeCodeCamp DeveloperCoder survey which talks about various attributes pertaining to coders and software developers. fcc_survey_df = pd.read_csv('datasets/fcc_2016_coder_survey_subset.csv', encoding='utf-8') fcc_survey_df[['ID.x', 'EmploymentField', 'Age', 'Income']].head() Sample attributes from the FCC coder survey dataset The ID.x variable is basically a unique identifier for each coderdeveloper who took the survey and the other fields are pretty self-explanatory. Fixed-Width Binning Just like the name indicates, in fixed-width binning, we have specific fixed widths for each of the bins which are usually pre-defined by the user analyzing the data. Each bin has a pre-fixed range of values which should be assigned to that bin on the basis of some domain knowledge, rules or constraints. Binning based on rounding is one of the ways, where you can use the rounding operation which we discussed earlier to bin raw values. Let’s now consider the Age feature from the coder survey dataset and look at its distribution. fig, ax = plt.subplots() fcc_survey_df['Age'].hist(color='#A9C5D3', edgecolor='black', grid=False) ax.set_title('Developer Age Histogram', fontsize=12) ax.set_xlabel('Age', fontsize=12) ax.set_ylabel('Frequency', fontsize=12) Histogram depicting developer age distribution The above histogram depicting developer ages is slightly right skewed as expected (lesser aged developers). We will now assign these raw age values into specific bins based on the following scheme Age Range: Bin --------------- 0 - 9 : 0 10 - 19 : 1 20 - 29 : 2 30 - 39 : 3 40 - 49 : 4 50 - 59 : 5 60 - 69 : 6 ... and so on We can easily do this using what we learnt in the Rounding section earlier where we round off these raw age values by taking the floor value after dividing it by 10. fcc_survey_df['Age_bin_round'] = np.array(np.floor( np.array(fcc_survey_df['Age']) / 10.)) fcc_survey_df[['ID.x', 'Age', 'Age_bin_round']].iloc[1071:1076] Binning by rounding You can see the corresponding bins for each age have been assigned based on rounding. But what if we need more flexibility? What if we want to decide and fix the bin widths based on our own ruleslogic? Binning based on custom ranges will help us achieve this. Let’s define some custom age ranges for binning developer ages using the following scheme. Age Range : Bin --------------- 0 - 15 : 1 16 - 30 : 2 31 - 45 : 3 46 - 60 : 4 61 - 75 : 5 75 - 100 : 6 Based on this custom binning scheme, we will now label the bins for each developer age value and we will store both the bin range as well as the corresponding label. bin_ranges = [0, 15, 30, 45, 60, 75, 100] bin_names = [1, 2, 3, 4, 5, 6] fcc_survey_df['Age_bin_custom_range'] = pd.cut( np.array( fcc_survey_df['Age']), bins=bin_ranges) fcc_survey_df['Age_bin_custom_label'] = pd.cut( np.array( fcc_survey_df['Age']), bins=bin_ranges, labels=bin_names) # view the binned features fcc_survey_df[['ID.x', 'Age', 'Age_bin_round', 'Age_bin_custom_range', 'Age_bin_custom_label']].iloc[10a71:1076] Custom binning scheme for developer ages Adaptive Binning The drawback in using fixed-width binning is that due to us manually deciding the bin ranges, we can end up with irregular bins which are not uniform based on the number of data points or values which fall in each bin. Some of the bins might be densely populated and some of them might be sparsely populated or even empty! Adaptive binning is a safer strategy in these scenarios where we let the data speak for itself! That’s right, we use the data distribution itself to decide our bin ranges. Quantile based binning is a good strategy to use for adaptive binning. Quantiles are specific values or cut-points which help in partitioning the continuous valued distribution of a specific numeric field into discrete contiguous bins or intervals. Thus, q-Quantiles help in partitioning a numeric attribute into q equal partitions. Popular examples of quantiles include the 2-Quantile known as the median which divides the data distribution into two equal bins, 4-Quantiles known as the quartiles which divide the data into 4 equal bins and 10-Quantiles also known as the deciles which create 10 equal width bins. Let’s now look at the data distribution for the developer Income field. fig, ax = plt.subplots() fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3', edgecolor='black', grid=False) ax.set_title('Developer Income Histogram', fontsize=12) ax.set_xlabel('Developer Income', fontsize=12) ax.set_ylabel('Frequency', fontsize=12) Histogram depicting developer income distribution The above distribution depicts a right skew in the income with lesser developers earning more money and vice versa. Let’s take a 4-Quantile or a quartile based adaptive binning scheme. We can obtain the quartiles easily as follows. quantile_list = [0, .25, .5, .75, 1.] quantiles = fcc_survey_df['Income'].quantile(quantile_list) quantiles Output ------ 0.00 6000.0 0.25 20000.0 0.50 37000.0 0.75 60000.0 1.00 200000.0 Name: Income, dtype: float64 Let’s now visualize these quantiles in the original distribution histogram! fig, ax = plt.subplots() fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3', edgecolor='black', grid=False) for quantile in quantiles: qvl = plt.axvline(quantile, color='r') ax.legend([qvl], ['Quantiles'], fontsize=10) ax.set_title('Developer Income Histogram with Quantiles', fontsize=12) ax.set_xlabel('Developer Income', fontsize=12) ax.set_ylabel('Frequency', fontsize=12) Histogram depicting developer income distribution with quartile values The red lines in the distribution above depict the quartile values and our potential bins. Let’s now leverage this knowledge to build our quartile based binning scheme. quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q'] fcc_survey_df['Income_quantile_range'] = pd.qcut( fcc_survey_df['Income'], q=quantile_list) fcc_survey_df['Income_quantile_label'] = pd.qcut( fcc_survey_df['Income'], q=quantile_list, labels=quantile_labels) fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_quantile_range', 'Income_quantile_label']].iloc[4:9] Quantile based bin ranges and labels for developer incomes This should give you a good idea of how quantile based adaptive binning works. An important point to remember here is that the resultant outcome of binning leads to discrete valued categorical features and you might need an additional step of feature engineering on the categorical data before using it in any model. We will cover feature engineering strategies for categorical data shortly in the next part! Statistical Transformations We talked about the adverse effects of skewed data distributions briefly earlier. Let’s look at a different strategy of feature engineering now by making use of statistical or mathematical transformations.We will look at the Log transform as well as the Box-Cox transform. Both of these transform functions belong to the Power Transform family of functions, typically used to create monotonic data transformations. Their main significance is that they help in stabilizing variance, adhering closely to the normal distribution and making the data independent of the mean based on its distribution Log Transform The log transform belongs to the power transform family of functions. This function can be mathematically represented as which reads as log of x to the base b is equal to y. This can then be translated into which indicates as to what power must the base b be raised to in order to get x. The natural logarithm uses b=e where e = 2.71828 popularly known as Euler’s number. You can also use base b=10 used popularly in the decimal system. Log transforms are useful when applied to skewed distributions as they tend to expand the values which fall in the range of lower magnitudes and tend to compress or reduce the values which fall in the range of higher magnitudes. This tends to make the skewed distribution as normal-like as possible. Let’s use log transform on our developer Income feature which we used earlier. fcc_survey_df['Income_log'] = np.log((1+ fcc_survey_df['Income'])) fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_log']].iloc[4:9] Log transform on developer income The Income_log field depicts the transformed feature after log transformation. Let’s look at the data distribution on this transformed field now. income_log_mean = np.round(np.mean(fcc_survey_df['Income_log']), 2) fig, ax = plt.subplots() fcc_survey_df['Income_log'].hist(bins=30, color='#A9C5D3', edgecolor='black', grid=False) plt.axvline(income_log_mean, color='r') ax.set_title('Developer Income Histogram after Log Transform', fontsize=12) ax.set_xlabel('Developer Income (log scale)', fontsize=12) ax.set_ylabel('Frequency', fontsize=12) ax.text(11.5, 450, r'mu='+str(income_log_mean), fontsize=10) Histogram depicting developer income distribution after log transform Based on the above plot, we can clearly see that the distribution is more normal-like or gaussian as compared to the skewed distribution on the original data. Box-Cox Transform The Box-Cox transform is another popular function belonging to the power transform family of functions. This function has a pre-requisite that the numeric values to be transformed must be positive (similar to what log transform expects). In case they are negative, shifting using a constant value helps. Mathematically, the Box-Cox transform function can be denoted as follows. Such that the resulted transformed output y is a function of input x and the transformation parameter λ such that when λ = 0, the resultant transform is the natural log transform which we discussed earlier. The optimal value of λ is usually determined using a maximum likelihood or loglikelihood estimation. Let’s now apply the Box-Cox transform on our developer income feature. First we get the optimal lambda value from the data distribution by removing the non-null values as follows. income = np.array(fcc_survey_df['Income']) income_clean = income[~np.isnan(income)] l, opt_lambda = spstats.boxcox(income_clean) print('Optimal lambda value:', opt_lambda) Output ------ Optimal lambda value: 0.117991239456 Now that we have obtained the optimal λ value, let us use the Box-Cox transform for two values of λ such that λ = 0 and λ = λ(optimal) and transform the developer Income feature. fcc_survey_df['Income_boxcox_lambda_0'] = spstats.boxcox( (1+fcc_survey_df['Income']), lmbda=0) fcc_survey_df['Income_boxcox_lambda_opt'] = spstats.boxcox( fcc_survey_df['Income'], lmbda=opt_lambda) fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_log', 'Income_boxcox_lambda_0', 'Income_boxcox_lambda_opt']].iloc[4:9] Developer income distribution after Box-Cox transform The transformed features are depicted in the above data frame. Just like we expected, Income_log and Income_boxcox_lamba_0 have the same values. Let’s look at the distribution of the transformed Income feature after transforming with the optimal λ. income_boxcox_mean = np.round( np.mean( fcc_survey_df['Income_boxcox_lambda_opt']),2) fig, ax = plt.subplots() fcc_survey_df['Income_boxcox_lambda_opt'].hist(bins=30, color='#A9C5D3', edgecolor='black', grid=False) plt.axvline(income_boxcox_mean, color='r') ax.set_title('Developer Income Histogram after Box–Cox Transform', fontsize=12) ax.set_xlabel('Developer Income (Box–Cox transform)', fontsize=12) ax.set_ylabel('Frequency', fontsize=12) ax.text(24, 450, r'mu='+str(income_boxcox_mean), fontsize=10) Histogram depicting developer income distribution after Box-Cox transform The distribution looks more normal-like similar to what we obtained after the log transform. Conclusion Feature engineering is a very important aspect of machine learning and data science and should never be ignored. While we have automated feature engineering methodologies like deep learning as well as automated machine learning frameworks like AutoML (which still stresses that it requires good features to work well!). Feature engineering is here to stay and even some of these automated methodologies often require specific engineered features based on the data type, domain and the problem to be solved. We looked at popular strategies for feature engineering on continuous numeric data in this article. In the next part, we will look at popular strategies for dealing with discrete, categorical data and then move on to unstructured data types in future articles. Stay tuned! All the code and datasets used in this article can be accessed from my GitHub The code is also available as a Jupyter notebook Follow Machine Learning Data Science Python Feature Engineering Tds Feature Engineering UNDERSTANDING FEATURE ENGINEERING (PART 2) Categorical Data Strategies for working with discrete, categorical data Dipanjan (DJ) Sarkar · Follow Published in Towards Data Science 14 min read · Jan 6, 2018 Listen Share More Source: https://pixabay.com Introduction We covered various feature engineering strategies for dealing with structured continuous numeric data in the previous article in this series. In this article, we will look at another type of structured data, which is discrete in nature and is popularly termed as categorical data. Dealing with numeric data is often easier than categorical data given that we do not have to deal with additional complexities of the semantics pertaining to each category value in any data attribute which is of a categorical type. We will use a hands-on approach to discuss several encoding Get unlimited access to the best of Medium for less than 1/week. Become a member schemes for dealing with categorical data and also a couple of popular techniques for dealing with large scale feature explosion, often known as the “curse of dimensionality”. Motivation I’m sure by now you must realize the motivation and the importance of feature engineering, we do stress on the same in detail in ‘Part 1’ of this series. Do check it out for a quick refresher if necessary. In short, machine learning algorithms cannot work directly with categorical data and you do need to do some amount of engineering and transformations on this data before you can start modeling on your data. Understanding Categorical Data Let’s get an idea about categorical data representations before diving into feature engineering strategies. Typically, any data attribute which is categorical in nature represents discrete values which belong to a specific finite set of categories or classes. These are also often known as classes or labels in the context of attributes or variables which are to be predicted by a model (popularly known as response variables). These discrete values can be text or numeric in nature (or even unstructured data like images!). There are two major classes of categorical data, nominal and ordinal. In any nominal categorical data attribute, there is no concept of ordering amongst the values of that attribute. Consider a simple example of weather categories, as depicted in the following figure. We can see that we have six major classes or categories in this particular scenario without any concept or notion of order (windy doesn’t always occur before sunny nor is it smaller or bigger than sunny). Weather as a categorical attribute Similarly movie, music and video game genres, country names, food and cuisine types are other examples of nominal categorical attributes. Ordinal categorical attributes have some sense or notion of order amongst its values. For instance look at the following figure for shirt sizes. It is quite evident that order or in this case ‘size’ matters when thinking about shirts (S is smaller than M which is smaller than L and so on). Shirt size as an ordinal categorical attribute Shoe sizes, education level and employment roles are some other examples of ordinal categorical attributes. Having a decent idea about categorical data, let’s now look at some feature engineering strategies. Feature Engineering on Categorical Data While a lot of advancements have been made in various machine learning frameworks to accept complex categorical data types like text labels. Typically any standard workflow in feature engineering involves some form of transformation of these categorical values into numeric labels and then applying some encoding scheme on these values. We load up the necessary essentials before getting started. import pandas as pd import numpy as np Transforming Nominal Attributes Nominal attributes consist of discrete categorical values with no notion or sense of order amongst them. The idea here is to transform these attributes into a more representative numerical format which can be easily understood by downstream code and pipelines. Let’s look at a new dataset pertaining to video game sales. This dataset is also available on Kaggle as well as in my GitHub repository. vg_df = pd.read_csv('datasets/vgsales.csv', encoding='utf-8') vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7] Dataset for video game sales Let’s focus on the video game Genre attribute as depicted in the above data frame. It is quite evident that this is a nominal categorical attribute just like Publisher and Platform . We can easily get the list of unique video game genres as follows. genres = np.unique(vg_df['Genre']) genres Output ------ array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle', 'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports', 'Strategy'], dtype=object) This tells us that we have 12 distinct video game genres. We can now generate a label encoding scheme for mapping each category to a numeric value by leveraging scikit-learn . from sklearn.preprocessing import LabelEncoder gle = LabelEncoder() genre_labels = gle.fit_transform(vg_df['Genre']) genre_mappings = {index: label for index, label in enumerate(gle.classes_)} genre_mappings Output ------ {0: 'Action', 1: 'Adventure', 2: 'Fighting', 3: 'Misc', 4: 'Platform', 5: 'Puzzle', 6: 'Racing', 7: 'Role-Playing', 8: 'Shooter', 9: 'Simulation', 10: 'Sports', 11: 'Strategy'} Thus a mapping scheme has been generated where each genre value is mapped to a number with the help of the LabelEncoder object gle . The transformed labels are stored in the genre_labels value which we can write back to our data frame. vg_df['GenreLabel'] = genre_labels vg_df[['Name', 'Platform', 'Year', 'Genre', 'GenreLabel']].iloc[1:7] Video game genres with their encoded labels These labels can be used directly often especially with frameworks like scikitlearn if you plan to use them as response variables for prediction, however as discussed earlier, we will need an additional step of encoding on these before we can use them as features. Transforming Ordinal Attributes Ordinal attributes are categorical attributes with a sense of order amongst the values. Let’s consider our Pokémon dataset which we used in Part 1 of this series. Let’s focus more specifically on the Generation attribute. poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8') poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True) np.unique(poke_df['Generation']) Output ------ array(['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6'], dtype=object) Based on the above output, we can see there are a total of 6 generations and each Pokémon typically belongs to a specific generation based on the video games (when they were released) and also the television series follows a similar timeline. This attribute is typically ordinal (domain knowledge is necessary here) because most Pokémon belonging to Generation 1 were introduced earlier in the video games and the television shows than Generation 2 as so on. Fans can check out the following figure to remember some of the popular Pokémon of each generation (views may differ among fans!). Popular Pokémon based on generation and type (source: https://www.reddit.com/r/pokemon/comments/2s2upx/heres_my_favorite_pokemon_by_type_and_gen_chart) Hence they have a sense of order amongst them. In general, there is no generic module or function to map and transform these features into numeric representations based on order automatically. Hence we can use a custom encodingmapping scheme. gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6} poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map) poke_df[['Name', 'Generation', 'GenerationLabel']].iloc[4:10] Pokémon generation encoding It is quite evident from the above code that the map(…) function from pandas is quite helpful in transforming this ordinal feature. Encoding Categorical Attributes If you remember what we mentioned earlier, typically feature engineering on categorical data involves a transformation process which we depicted in the previous section and a compulsory encoding process where we apply specific encoding schemes to create dummy variables or features for each categoryvalue in a specific categorical attribute. You might be wondering, we just converted categories to numerical labels in the previous section, why on earth do we need this now? The reason is quite simple. Considering video game genres, if we directly fed the GenreLabel attribute as a feature in a machine learning model, it would consider it to be a continuous numeric feature thinking value 10 (Sports) is greater than 6 (Racing) but that is meaningless because the Sports genre is certainly not bigger or smaller than Racing, these are essentially different values or categories which cannot be compared directly. Hence we need an additional layer of encoding schemes where dummy features are created for each unique value or category out of all the distinct categories per attribute. One-hot Encoding Scheme Considering we have the numeric representation of any categorical attribute with m labels (after transformation), the one-hot encoding scheme, encodes or transforms the attribute into m binary features which can only contain a value of 1 or 0. Each observation in the categorical feature is thus converted into a vector of size m with only one of the values as 1 (indicating it as active). Let’s take a subset of our Pokémon dataset depicting two attributes of interest. poke_df[['Name', 'Generation', 'Legendary']].iloc[4:10] Subset of our Pokémon dataset The attributes of interest are Pokémon Generation and their Legendary status. The first step is to transform these attributes into numeric representations based on what we learnt earlier. from sklearn.preprocessing import OneHotEncoder, LabelEncoder # transform and map pokemon generations gen_le = LabelEncoder() gen_labels = gen_le.fit_transform(poke_df['Generation']) poke_df['Gen_Label'] = gen_labels # transform and map pokemon legendary status leg_le = LabelEncoder() leg_labels = leg_le.fit_transform(poke_df['Legendary']) poke_df['Lgnd_Label'] = leg_labels poke_df_sub = poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary', 'Lgnd_Label']] poke_df_sub.iloc[4:10] Attributes with transformed (numeric) labels The features Gen_Label and Lgnd_Label now depict the numeric representations of our categorical features. Let’s now apply the one-hot encoding scheme on these features. # encode generation labels using one-hot encoding scheme gen_ohe = OneHotEncoder() gen_feature_arr = gen_ohe.fit_transform( poke_df[['Gen_Label']]).toarray() gen_feature_labels = list(gen_le.classes_) gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels) # encode legendary status labels using one-hot encoding scheme leg_ohe = OneHotEncoder() leg_feature_arr = leg_ohe.fit_transform( poke_df[['Lgnd_Label']]).toarray() leg_feature_labels = ['Legendary_'+str(cls_label) for cls_label in leg_le.classes_] leg_features = pd.DataFrame(leg_feature_arr, columns=leg_feature_labels) In general, you can always encode both the features together using the fit_transform(…) function by passing it a two dimensional array of the two features together (Check out the documentation!). But we encode each feature separately, to make things easier to understand. Besides this, we can also create separate data frames and label them accordingly. Let’s now concatenate these feature frames and see the final result. poke_df_ohe = pd.concat([poke_df_sub, gen_features, leg_features], axis=1) columns = sum([['Name', 'Generation', 'Gen_Label'], gen_feature_labels, ['Legendary', 'Lgnd_Label'], leg_feature_labels], []) poke_df_ohe[columns].iloc[4:10] One-hot encoded features for Pokémon generation and legendary status Thus you can see that 6 dummy variables or binary features have been created for Generation and 2 for Legendary since those are the total number of distinct categories in each of these attributes respectively. Active state of a category is indicated by the 1 value in one of these dummy variables which is quite evident from the above data frame. Consider you built this encoding scheme on your training data and built some model and now you have some new data which has to be engineered for features before predictions as follows. new_poke_df = pd.DataFrame([['PikaZoom', 'Gen 3', True], ['CharMyToast', 'Gen 4', False]], columns=['Name', 'Generation', 'Legendary']) new_poke_df Sample new data You can leverage scikit-learn’s excellent API here by calling the transform(…) function of the previously build LabeLEncoder and OneHotEncoder objects on the new data. Remember our workflow, first we do the transformation. new_gen_labels = gen_le.transform(new_poke_df['Generation']) new_poke_df['Gen_Label'] = new_gen_labels new_leg_labels = leg_le.transform(new_poke_df['Legendary']) new_poke_df['Lgnd_Label'] = new_leg_labels new_poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary', 'Lgnd_Label']] Categorical attributes after transformation Once we have numerical labels, let’s apply the encoding scheme now! new_gen_feature_arr = gen_ohe.transform(new_poke_df[['Gen_Label']]).toarray() new_gen_features = pd.DataFrame(new_gen_feature_arr, columns=gen_feature_labels) new_leg_feature_arr = leg_ohe.transform(new_poke_df[['Lgnd_Label']]).toarray() new_leg_features = pd.DataFrame(new_leg_feature_arr, columns=leg_feature_labels) new_poke_ohe = pd.concat([new_poke_df, new_gen_features, new_leg_features], axis=1) columns = sum([['Name', 'Generation', 'Gen_Label'], gen_feature_labels, ['Legendary', 'Lgnd_Label'], leg_feature_labels], []) new_poke_ohe[columns] Categorical attributes after one-hot encoding Thus you can see it’s quite easy to apply this scheme on new data easily by leveraging scikit-learn’s powerful API. You can also apply the one-hot encoding scheme easily by leveraging the to_dummies(…) function from pandas . gen_onehot_features = pd.get_dummies(poke_df['Generation']) pd.concat([poke_df[['Name', 'Generation']], gen_onehot_features], axis=1).iloc[4:10] One-hot encoded features by leveraging pandas The above data frame depicts the one-hot encoding scheme applied on the Generation attribute and the results are same as compared to the earlier results as expected. Dummy Coding Scheme The dummy coding scheme is similar to the one-hot encoding scheme, except in the case of dummy coding scheme, when applied on a categorical feature with m distinct labels, we get m - 1 binary features. Thus each value of the categorical variable gets converted into a vector of size m - 1. The extra feature is completely disregarded and thus if the category values range from {0, 1, …, m-1} the 0th or the m - 1th feature column is dropped and corresponding category values are usually represented by a vector of all zeros (0). Let’s try applying dummy coding scheme on Pokémon Generation by dropping the first level binary encoded feature ( Gen 1 ). gen_dummy_features = pd.get_dummies(poke_df['Generation'], drop_first=True) pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], axis=1).iloc[4:10] Dummy coded features for Pokémon g eneration If you want, you can also choose to drop the last level binary encoded feature ( Gen 6 ) as follows. gen_onehot_features = pd.get_dummies(poke_df['Generation']) gen_dummy_features = gen_onehot_features.iloc[:,:-1] pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], axis=1).iloc[4:10] Dummy coded features for Pokémon g eneration Open in app Search Based on the above depictions, it is quite clear that categories belonging to the dropped feature are represented as a vector of zeros (0) like we discussed earlier. Effect Coding Scheme The effect coding scheme is actually very similar to the dummy coding scheme, except during the encoding process, the encoded features or feature vector, for the category values which represent all 0 in the dummy coding scheme, is replaced by -1 in the effect coding scheme. This will become clearer with the following example. gen_onehot_features = pd.get_dummies(poke_df['Generation']) gen_effect_features = gen_onehot_features.iloc[:,:-1] gen_effect_features.loc[np.all(gen_effect_features == 0, axis=1)] = -1. pd.concat([poke_df[['Name', 'Generation']], gen_effect_features], axis=1).iloc[4:10] Effect coded features for Pokémon g eneration The above output clearly shows that the Pokémon belonging to Generation 6 are now represented by a vector of -1 values as compared to zeros in dummy coding. Bin-counting Scheme The encoding schemes we discussed so far, work quite well on categorical data in general, but they start causing problems when the number of distinct categories in any feature becomes very large. Essential for any categorical feature of m distinct labels, you get m separate features. This can easily increase the size of the feature set causing problems like storage issues, model training problems with regard to time, space and memory. Besides this, we also have to deal with what is popularly known as the ‘curse of dimensionality’ where basically with an enormous number of features and not enough representative samples, model performance starts getting affected often leading to overfitting. Hence we need to look towards other categorical data feature engineering schemes for features having a large number of possible categories (like IP addresses). The bin-counting scheme is a useful scheme for dealing with categorical variables having many categories. In this scheme, instead of using the actual label values for encoding, we use probability based statistical information about the value and the actual target or response value which we aim to predict in our modeling efforts. A simple example would be based on past historical data for IP addresses and the ones which were used in DDOS attacks; we can build probability values for a DDOS attack being caused by any of the IP addresses. Using this information, we can encode an input feature which depicts that if the same IP address comes in the future, what is the probability value of a DDOS attack being caused. This scheme needs historical data as a pre-requisite and is an elaborate one. Depicting this with a complete example would be currently difficult here but there are several resources online which you can refer to for the same. Feature Hashing Scheme The feature hashing scheme is another useful feature engineering scheme for dealing with large scale categorical features. In this scheme, a hash function is typically used with the number of encoded features pre-set (as a vector of predefined length) such that the hashed values of the features are used as indices in this pre-defined vector and values are updated accordingly. Since a hash function maps a large number of values into a small finite set of values, multiple different values might create the same hash which is termed as collisions. Typically, a signed hash function is used so that the sign of the value obtained from the hash is used as the sign of the value which is stored in the final feature vector at the appropriate index. This should ensure lesser collisions and lesser accumulation of error due to collisions. Hashing schemes work on strings, numbers and other structures like vectors. You can think of hashed outputs as a finite set of b bins such that when hash function is applied on the same valuescategories, they get assigned to the same bin (or subset of bins) out of the b bins based on the hash value. We can pre-define the value of b which becomes the final size of the encoded feature vector for each categorical attribute that we encode using the feature hashing scheme. Thus even if we have over 1000 distinct categories in a feature and we set b=10 as the final feature vector size, the output feature set will still have only 10 features as compared to 1000 binary features if we used a one-hot encoding scheme. Let’s consider the Genre attribute in our video game dataset. unique_genres = np.unique(vg_df[['Genre']]) print("Total game genres:", len(unique_genres)) print(unique_genres) Output ------ Total game genres: 12 ['Action' 'Adventure' 'Fighting' 'Misc' 'Platform' 'Puzzle' 'Racing' 'Role-Playing' 'Shooter' 'Simulation' 'Sports' 'Strategy'] We can see that there are a total of 12 genres of video games. If we used a one-hot encoding scheme on the Genre feature, we would end up having 12 binary features. Instead, we will now use a feature hashing scheme by leveraging scikit-learn’s FeatureHasher class, which uses a signed 32-bit version of the Murmurhash3 hash function. We will pre-define the final feature vector size to be 6 in this case. from sklearn.feature_extraction import FeatureHasher fh = FeatureHasher(n_features=6, input_type='string') hashed_features = fh.fit_transform(vg_df['Genre']) hashed_features = hashed_features.toarray() pd.concat([vg_df[['Name', 'Genre']], pd.DataFrame(hashed_features)], axis=1).iloc[1:7] Feature Hashing on the Genre attribute Based on the above output, the Genre categorical attribute has been encoded using the hashing scheme into 6 features instead of 12. We can also see that rows 1 and 6 denote the same genre of games, Platform which have been rightly encoded into the same feature vector. Conclusion These examples should give you a good idea about popular strategies for feature engineering on discrete, categorical data. If you read Part 1 of this series, you would have seen that it is slightly challenging to work with categorical data as compared to continuous, numeric data but definitely interesting! We also talked about some ways to handle large feature spaces using feature engineering but you should also remember that there are other techniques including feature selection and dimensionality reduction methods to handle large feature spaces. We will cover some of these methods in a later article. Next up will be feature engineering strategies for unstructured text data. Stay tuned! To read about feature engineering strategies for continuous numeric data, check out Part 1 of this series! All the code and datasets used in this article can be accessed from my GitHub The code is also available as a Jupyter notebook Machine Learning Data Science Feature Engineering Programming Tds Feature Engineering In this article, we will look at various feature engineering techniques for extracting useful information using the datetime column. And if you’re new to time series, I encourage you to check out the below free course: Creating Time Series Forecast using Python Table of contents Overview Introduction Quick Introduction to Time Series Setting up the Problem Statement for Time Series Data Feature Engineering for Time Series #1: Date-Related Features Feature Engineering for Time Series #2: Time-Based Features Feature Engineering for Time Series #3: Lag Features Feature Engineering for Time Series #4: Rolling Window Feature Feature Engineering for Time Series #5: Expanding Window Feature Feature Engineering for Time Series #6: Domain-Specific Features Validation Technique for Time Series Frequently Asked Questions End Notes Quick Introduction to Time Series Before we look at the feature engineering techniques, let’s brush over some basic time series concepts. We’ll be using them throughout the article so it’s best to be acquainted with them here. So, what makes time series projects different from the traditional machine learning problems? Let’s take a simple example to understand this. If we want to predict today’s stock price for a certain company, it would be helpful to have information about yesterday’s closing price, right? Similarly, predicting the traffic on a website would be a lot easier if we have data about the last few months or years. There’s another thing we need to consider – time series data may also have certain trends or seasonality. Take a look at the plot shown below about the number of tickets booked for an airline over the years: “ In a time series, the data is captured at equal intervals and each successive data point in the series depends on its past values. view raw view raw We can clearly see an increasing trend. Such information can be useful for making more accurate predictions. Now, let’s take a dataset with date-time variables and start learning about feature engineering! Setting up the Problem Statement for Time Series Data We’ll be working on a fascinating problem to learn feature engineering techniques for time series. We have the historical data for ‘JetRail’, a form of public rail transport, that uses advanced technology to run rails at a high speed. JetRail’s usage has increased recently and we have to forecast the traffic on JetRail for the next 7 months based on past data. Let’s see how we can help JetRail’s management team solve this problem. You can go through the detailed problem statement and download the dataset from here. Let’s load the dataset in our notebook: loading_data.py hosted with ❤ by GitHub We have two columns here – so it’s clearly a univariate time series. Also, the data type of the date variable is taken as an object, i.e. it is being treated as a categorical variable. Hence, we will need to convert this into a DateTime variable. We can do this using the appropriately titled datetime function in Pandas: changing_dtype.py hosted with ❤ by GitHub Now that we have the data ready, let’s look at the different features we can engineer from this variable. Along with each of these feature engineering techniques, we will discuss different scenarios where that particular technique can be useful. NOTE: I have taken a simple time series problem to demonstrate the different feature engineering techniques in this article. You can use them on a dataset of your choice as long as the date-time column is present. Feature Engineering for Time Series #1: Date-Related Features Have you ever worked in a product company? You’ll be intimately familiar with the task of forecasting the sales for a particular product. We can find out the sales pattern for weekdays and weekends based on historical data. Thus, having information about the day, month, year, etc. can be useful for forecasting the values. Let’s get back to our JetRail project. We have to forecast the count of people who will take the JetRail on an hourly basis for the next 7 months. This number could be higher for weekdays and lower for weekends or during the festive seasons. Hence, the day of the week (weekday or weekend) or month will be an important factor. Extracting these features is really easy in Python: 1 import pandas as pd 2 data = pd.read_csv('Train_SU63ISt.csv') 3 data.dtypes 1 import pandas as pd 2 data = pd.read_csv('Train_SU63ISt.csv') 3 data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M') 4 data.dtypes 1 import pandas as pd 2 data = pd.read_csv('Train_SU63ISt.csv') 3 data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M') view raw view raw date_features.py hosted with ❤ by GitHub Feature Engineering for Time Series #2: Time-Based Features We can similarly extract more granular features if we have the time stamp. For instance, we can determine the hour or minute of the day when the data was recorded and compare the trends between the business hours and non-business hours. If we are able to extract the ‘hour’ feature from the time stamp, we can make more insightful conclusions about the data. We could find out if the traffic on JetRail is higher during the morning, afternoon or evening time. Or we could use the value to determine the average hourly traffic throughout the week, i.e. the number of people who used JetRail between 9-10 am, 10-11 am, and so on (throughout the week). Extracting time-based features is very similar to what we did above when extracting date-related features. We start by converting the column to DateTime format and use the .dt accessor. Here’s how to do it in Python: time_features.py hosted with ❤ by GitHub Similarly, we can extract a number of features from the date column. Here’s a complete list of features that we can generate: 4 5 data['year']=data['Datetime'].dt.year 6 data['month']=data['Datetime'].dt.month 7 data['day']=data['Datetime'].dt.day 8 9 data['dayofweek_num']=data['Datetime'].dt.dayofweek 10 data['dayofweek_name']=data['Datetime'].dt.weekday_name 11 12 data.head() 1 import pandas as pd 2 data = pd.read_csv('Train_SU63ISt.csv') 3 data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M') 4 5 data['Hour'] = data['Datetime'].dt.hour 6 data['minute'] = data['Datetime'].dt.minute 7 8 data.head() Run the code below to generate the date and hour features for the given data. You can select any of the above functions and run the following code to generate a new feature for the same! @ L a k s h a y A r o r a 1 / Time B a s e d F e a t u r e s A P y t h o n r e p l b y L a k s h a y A r o r a 1 O p e n o n R e p lit S h o w c o d e 0 R u n L A view raw view raw Feature Engineering for Time Series #3: Lag Features Here’s something most aspiring data scientists don’t think about when working on a time series problem – we can also use the target variable for feature engineering! Consider this – you are predicting the stock price for a company. So, the previous day’s stock price is important to make a prediction, right? In other words, the value at time t is greatly affected by the value at time t-1. The past values are known as lags, so t-1 is lag 1, t-2 is lag 2, and so on. lag_feature.py hosted with ❤ by GitHub Here, we were able to generate lag one feature for our series. But why lag one? Why not five or seven? That’s a good question. If the series has a weekly trend, which means the value last Monday can be used to predict the value for this Monday, you should create lag features for seven days. Getting the drift? We can create multiple lag features as well! Let’s say we want lag 1 to lag 7 – we can let the model decide which is the most valuable one. So, if we train a linear regression model, it will assign appropriate weights (or coefficients) to the lag features: lag_seven.py hosted with ❤ by GitHub 1 import pandas as pd 2 data = pd.read_csv('Train_SU63ISt.csv') 3 data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M') 4 5 data['lag_1'] = data['Count'].shift(1) 6 data = data[['Datetime', 'lag_1', 'Count']] 7 data.head() ‘ The lag value we choose will depend on the correlation of individual values with its past values. 1 import pandas as pd 2 data = pd.read_csv('Train_SU63ISt.csv') 3 data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M') 4 5 data['lag_1'] = data['Count'].shift(1) 6 data['lag_2'] = data['Count'].shift(2) 7 data['lag_3'] = data['Count'].shift(3) 8 data['lag_4'] = data['Count'].shift(4) 9 data['lag_5'] = data['Count'].shift(5) 10 data['lag_6'] = data['Count'].shift(6) 11 data['lag_7'] = data['Count'].shift(7) 12 13 data = data[['Datetime', 'lag_1', 'lag_2', 'lag_3', 'lag_4', 'lag_5', 'lag_6', 'lag_7', 'Count']] 14 data.head(10) There is more than one way of determining the lag at which the correlation is significant. For instance, we can use the ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots. ACF: The ACF plot is a measure of the correlation between the time series and the lagged version of itself PACF: The PACF plot is a measure of the correlation between the time series with a lagged version of itself but after eliminating the variations already explained by the intervening comparisons For our particular example, here are the ACF and PACF plots: from statsmodels.graphics.tsaplots import plot_acf plot_acf(data['Count'], lags=10) plot_pacf(data['Count'], lags=10) An important point to note – the number of times you shift, the same number of values will be reduced from the data. You would see some rows with NaNs at the start. That’s because the first observation has no lag. You’ll need to discard these rows from the training data. Feature Engineering for Time Series #4: Rolling Window Feature In the last section, we looked at how we can use the previous values as features. Here’s an awesome gif that explains this idea in a wonderfully intuitive way: ‘ The partial autocorrelation function shows a high correlation with the first lag and lesser correlation with the second and third lag. The autocorrelation function shows a slow decay, which means that the future values have a very high correlation with its past values. ‘ How about calculating some statistical values based on past values? This method is called the rolling window method because the window would be different for every data point. view raw Since this looks like a window that is sliding with every next point, the features generated using this method are called the ‘rolling window’ features. Now the question we need to address – how are we going to perform feature engineering here? Let’s start simple. We will select a window size, take the average of the values in the window, and use it as a feature. Let’s implement it in Python: rolling_mean.py hosted with ❤ by GitHub Similarly, you can consider the sum, min, max value, etc. (for the selected window) as a feature and try it out on your own machine. Thus, we can use a weighted average, such that higher weights are given to the most recent observations. Mathematically, weighted average at time t for the past 7 values would be: w_avg = w1*(t-1) + w2*(t-2) + . . . . + w7*(t-7) where, w1>w2>w3> . . . . >w7. 1 import pandas as pd 2 data = pd.read_csv('Train_SU63ISt.csv') 3 data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M') 4 5 data['rolling_mean'] = data['Count'].rolling(window=7).mean() 6 data = data[['Datetime', 'rolling_mean', 'Count']] 7 data.head(10) ‘ Recency in an important factor in a time series. Values closer to the current date would hold more information. view raw Feature Engineering for Time Series #5: Expanding Window Feature This is simply an advanced version of the rolling window technique. In the case of a rolling window, the size of the window is constant while the window slides as we move forward in time. Hence, we consider only the most recent values and ignore the past values. Here’s a gif that explains how our expanding window function works: As you can see, with every step, the size of the window increases by one as it takes into account every new value in the series. This can be implemented easily in Python by using the expanding() function. Let’s code this using the same data: expanding_window.py hosted with ❤ by GitHub Here is a live coding window that generates the expanding window feature for the given data. Feel free to change the starting window size and print the results: ‘ The idea behind the expanding window feature is that it takes all the past values into account. 1 import pandas as pd 2 data = pd.read_csv('Train_SU63ISt.csv') 3 data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M') 4 5 data['expanding_mean'] = data['Count'].expanding(2).mean() 6 data = data[['Datetime','Count', 'expanding_mean']] 7 data.head(10) @ L a k s h a y A r o r a 1 / E x p a n din g Win d o w F e a t u r e A P y t h o n r e p l b y L a k s h a y A r o r a 1 O p e n o n R e p lit S h o w c o d e 0 R u n 1 3 L A Feature Engineering for Time Series #6: Domain-Specific Features This is the essence of feature engineering! Want to dive into this more? Let’s take an example. Below is the data provided by a retailer for a number of stores and products. Our task is to forecast the future demands for the products. We can come up with various features, like taking a lag or averaging the past values, among other things. But hold on. Let me ask you a question – would it be the right way to build lag features from lag(1) to lag(7) throughout the data? Certainly not! There are different stores and products, and the demand for each store and product would be significantly different. In this case, we can create lag features considering the store-product combination. Moreover, if we have knowledge about the products and the trends in the market, we would be able to generate more accurate (and fewer) features. Not only this, having a good understanding about the domain and data would help us in selecting the lag value and the window size. Additionally, based on your domain knowledge, you would be able to pull external data that adds more value to the model. Here’s what I mean – are the sales affected by the weather on the day? Will the sales increase/decrease on a national holiday?If yes, then you can use external datasets and include the list of holidays as a feature. Validation Technique for Time Series All the feature engineering techniques we have discussed can be used to convert a time series problem into a supervised machine learning problem. Once we have that, we can easily go ahead with machine learning algorithms like linear regression and random forest. But there is one important step that you should know before you jump to the model building process – creating a validation setfor time series. For the traditional machine learning problems, we randomly select subsets of data for the validation and test sets. But in these cases, each data point is dependent on its past values. If we randomly shuffle the data, we might be training on future data and predicting the past values! ‘ Having a good understanding of the problem statement, clarity of the end objective and knowledge of the available data is essential to engineer domain-specific features for the model. Let’s create a validation set for our problem. But first, we must check the duration for which we have the data: import pandas as pd data = pd.read_csv('Train_SU63ISt.csv') data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M') data['Datetime'].min(), data['Datetime'].max(), (data['Datetime'].max() -data['Datetime'].min()) (Timestamp('2012-08-25 00:00:00'), Timestamp('2014-09-25 23:00:00'), Timedelta('761 days 23:00:00')) We have data for almost 25 months. Let’s save three months for validation and use the remaining for training: data.index = data.Datetime Train=data.loc['2012-08-25':'2014-06-24'] valid=data.loc['2014-06-25':'2014-09-25'] Train.shape, valid.shape ((16056, 3), (2232, 3)) Great! We have the train and validation sets ready. You can now use these feature engineering techniques and build machine learning models on this data! Frequently Asked Questions Q1. What are the features oftime series? A. The features of a time series are the characteristics and patterns observed within the data over time. Some of the key features include: 1. Trend: The long-term movement or direction in the data, indicating overall growth or decline. 2. Seasonality: Regular and predictable patterns that repeat at fixed intervals. 3. Cyclic Patterns: Longer-term oscillations with varying periods, not necessarily repeating at fixed intervals. 4. Noise: Random fluctuations or irregularities in the data that do not follow any specific pattern. 5. Autocorrelation: The correlation of a time series with its own past values at different lags. 6. Level: The baseline or starting point of the time series data. Understanding these features is essential for time series analysis and forecasting. Q2. Whatis time series feature extraction method? A. Time series feature extraction methods involve transforming raw time series data into a set of relevant and informative features. Techniques like moving averages, exponential smoothing, Fourier transforms, wavelet transforms, and statistical measures (mean, variance, etc.) are used to extract characteristics such as trend, seasonality, periodicity, and statistical properties. These features are then used for time series analysis, classification, and forecasting tasks. End Notes Time Series is often considered a difficult topic to master. That’s understandable because there are a lot of moving parts when we’re working with the date and time components. But once you have a hang of the basic concepts and are able to perform feature engineering, you’ll be gliding through your projects in no time. In this article, we discussed some simple techniques that you can use to work with time series data. Using these feature engineering techniques, we can convert any time series problem into a supervised learning problem and build regression models. ‘ It is important that we carefully build a validation set when working on a time series problem, without destroying the sequential order within the data.  NNaavviiggaattiioonn Click to Take the FREE Probability Crash-Course Search...   228 How to Use ROC Curves and Precision-Recall Curves for Classification in Python by Jason Brownlee on October 11, 2023 in Probability It can be more flexible to predict probabilities of an observation belonging to each class in a classification problem rather than predicting classes directly. This flexibility comes from the way that probabilities may be interpreted using different thresholds that allow the operator of the model to trade-off concerns in the errors made by the model, such as the number of false positives compared to the number of false negatives. This is required when using models where the cost of one error outweighs the cost of other types of errors. Two diagnostic tools that help in the interpretation of probabilistic forecast for binary (two-class) classification predictive modeling problems are ROC Curves and Precision-Recall curves. In this tutorial, you will discover ROC Curves, Precision-Recall Curves, and when to use each to interpret the prediction of probabilities for binary classification problems. After completing this tutorial, you will know: ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds. Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds. ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets. Kick-start your project with my new book Probability for Machine Learning, including step-by-step tutorials and the Python source code files for all examples. Let’s get started. Update Aug/2018: Fixed bug in the representation of the no skill line for the precision-recall plot. Also fixed typo where I referred to ROC as relative rather than receiver (thanks spellcheck). Share Tweet Share Update Nov/2018: Fixed description on interpreting size of values on each axis, thanks Karl Humphries. Update Jun/2019: Fixed typo when interpreting imbalanced results. Update Oct/2019: Updated ROC Curve and Precision Recall Curve plots to add labels, use a logistic regression model and actually compute the performance of the no skill classifier. Update Nov/2019: Improved description of no skill classifier for precision-recall curve. Update Oct/2023: Minor update on code to make it more Pythonic How and When to Use ROC Curves and Precision-Recall Curves for Classification in Python Photo by Giuseppe Milo, some rights reserved. Tutorial Overview This tutorial is divided into 6 parts; they are: 1. Predicting Probabilities 2. What Are ROC Curves? 3. ROC Curves and AUC in Python 4. What Are Precision-Recall Curves? 5. Precision-Recall Curves and AUC in Python 6. When to Use ROC vs. Precision-Recall Curves? Predicting Probabilities In a classification problem, we may decide to predict the class values directly. Alternately, it can be more flexible to predict the probabilities for each class instead. The reason for this is to provide the capability to choose and even calibrate the threshold for how to interpret the predicted probabilities. For example, a default might be to use a threshold of 0.5, meaning that a probability in [0.0, 0.49] is a negative outcome (0) and a probability in [0.5, 1.0] is a positive outcome (1). This threshold can be adjusted to tune the behavior of the model for a specific problem. An example would be to reduce more of one or another type of error. When making a prediction for a binary or two-class classification problem, there are two types of errors that we could make. False Positive. Predict an event when there was no event. False Negative. Predict no event when in fact there was an event. By predicting probabilities and calibrating a threshold, a balance of these two concerns can be chosen by the operator of the model. For example, in a smog prediction system, we may be far more concerned with having low false negatives than low false positives. A false negative would mean not warning about a smog day when in fact it is a high smog day, leading to health issues in the public that are unable to take precautions. A false positive means the public would take precautionary measures when they didn’t need to. A common way to compare models that predict probabilities for two-class problems is to use a ROC curve. What Are ROC Curves? A useful tool when predicting the probability of a binary outcome is the Receiver Operating Characteristic curve, or ROC curve. It is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0. Put another way, it plots the false alarm rate versus the hit rate. The true positive rate is calculated as the number of true positives divided by the sum of the number of true positives and the number of false negatives. It describes how good the model is at predicting the positive class when the actual outcome is positive. The true positive rate is also referred to as sensitivity. The false positive rate is calculated as the number of false positives divided by the sum of the number of false positives and the number of true negatives. It is also called the false alarm rate as it summarizes how often a positive class is predicted when the actual outcome is negative. 1 True Positive Rate = True Positives / (True Positives + False Negatives) 1 Sensitivity = True Positives / (True Positives + False Negatives) 1 False Positive Rate = False Positives / (False Positives + True Negatives) The false positive rate is also referred to as the inverted specificity where specificity is the total number of true negatives divided by the sum of the number of true negatives and false positives. Where: The ROC curve is a useful tool for a few reasons: The curves of different models can be compared directly in general or for different thresholds. The area under the curve (AUC) can be used as a summary of the model skill. The shape of the curve contains a lot of information, including what we might care about most for a problem, the expected false positive rate, and the false negative rate. To make this clear: Smaller values on the x-axis of the plot indicate lower false positives and higher true negatives. Larger values on the y-axis of the plot indicate higher true positives and lower false negatives. If you are confused, remember, when we predict a binary outcome, it is either a correct prediction (true positive) or not (false positive). There is a tension between these options, the same with true negative and false negative. A skilful model will assign a higher probability to a randomly chosen real positive occurrence than a negative occurrence on average. This is what we mean when we say that the model has skill. Generally, skilful models are represented by curves that bow up to the top left of the plot. A no-skill classifier is one that cannot discriminate between the classes and would predict a random class or a constant class in all cases. A model with no skill is represented at the point (0.5, 0.5). A model with no skill at each threshold is represented by a diagonal line from the bottom left of the plot to the top right and has an AUC of 0.5. A model with perfect skill is represented at a point (0,1). A model with perfect skill is represented by a line that travels from the bottom left of the plot to the top left and then across the top to the top right. An operator may plot the ROC curve for the final model and choose a threshold that gives a desirable balance between the false positives and false negatives. Want to Learn Probability for Machine Learning Take my free 7-day email crash course now (with sample code). Click to sign-up and also get a free PDF Ebook version of the course. Download Your FREE Mini-Course 1 Specificity = True Negatives / (True Negatives + False Positives) 1 False Positive Rate = 1 - Specificity ROC Curves and AUC in Python We can plot a ROC curve for a model in Python using the roc_curve() scikit-learn function. The function takes both the true outcomes (0,1) from the test set and the predicted probabilities for the 1 class. The function returns the false positive rates for each threshold, true positive rates for each threshold and thresholds. The AUC for the ROC can be calculated using the roc_auc_score() function. Like the roc_curve() function, the AUC function takes both the true outcomes (0,1) from the test set and the predicted probabilities for the 1 class. It returns the AUC score between 0.0 and 1.0 for no skill and perfect skill respectively. A complete example of calculating the ROC curve and ROC AUC for a Logistic Regression model on a small test problem is listed below. 1 2 3 ... # calculate roc curve fpr, tpr, thresholds = roc_curve(y, probs) 1 2 3 4 ... # calculate AUC auc = roc_auc_score(y, probs) print('AUC: %.3f' % auc) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 import matplotlib.pyplot as plt # roc curve and auc from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve from sklearn.metrics import roc_auc_score # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # generate a no skill prediction (majority class) ns_probs = [0 for _ in range(len(testy))] # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # predict probabilities lr_probs = model.predict_proba(testX) # keep probabilities for the positive outcome only lr_probs = lr_probs[:, 1] # calculate scores ns_auc = roc_auc_score(testy, ns_probs) lr_auc = roc_auc_score(testy, lr_probs) # summarize scores print('No Skill: ROC AUC=%.3f' % (ns_auc)) print('Logistic: ROC AUC=%.3f' % (lr_auc)) # calculate roc curves ns_fpr, ns_tpr, _ = roc_curve(testy, ns_probs) lr_fpr, lr_tpr, _ = roc_curve(testy, lr_probs) # plot the roc curve for the model plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill') plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic') # axis labels Running the example prints the ROC AUC for the logistic regression model and the no skill classifier that only predicts 0 for all examples. A plot of the ROC curve for the model is also created showing that the model has skill. Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome. ROC Curve Plot for a No Skill Classifier and a Logistic Regression Model What Are Precision-Recall Curves? There are many ways to evaluate the skill of a prediction model. An approach in the related field of information retrieval (finding documents based on queries) measures precision and recall. 35 36 37 38 39 40 plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') # show the legend plt.legend() # show the plot plt.show() 1 2 No Skill: ROC AUC=0.500 Logistic: ROC AUC=0.903 These measures are also useful in applied machine learning for evaluating binary classification models. Precision is a ratio of the number of true positives divided by the sum of the true positives and false positives. It describes how good a model is at predicting the positive class. Precision is referred to as the positive predictive value. or Recall is calculated as the ratio of the number of true positives divided by the sum of the true positives and the false negatives. Recall is the same as sensitivity. or Reviewing both precision and recall is useful in cases where there is an imbalance in the observations between the two classes. Specifically, there are many examples of no event (class 0) and only a few examples of an event (class 1). The reason for this is that typically the large number of class 0 examples means we are less interested in the skill of the model at predicting class 0 correctly, e.g. high true negatives. Key to the calculation of precision and recall is that the calculations do not make use of the true negatives. It is only concerned with the correct prediction of the minority class, class 1. A precision-recall curve is a plot of the precision (y-axis) and the recall (x-axis) for different thresholds, much like the ROC curve. A no-skill classifier is one that cannot discriminate between the classes and would predict a random class or a constant class in all cases. The no-skill line changes based on the distribution of the positive to negative classes. It is a horizontal line with the value of the ratio of positive cases in the dataset. For a balanced dataset, this is 0.5. — The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets, 2015. A model with perfect skill is depicted as a point at (1,1). A skilful model is represented by a curve that bows towards (1,1) above the flat line of no skill. While the baseline is fixed with ROC, the baseline of [precision-recall curve] is determined by the ratio of positives (P) and negatives (N) as y = P / (P + N). For instance, we have y = 0.5 for a balanced class distribution …  1 Positive Predictive Power = True Positives / (True Positives + False Positives) 1 Precision = True Positives / (True Positives + False Positives) 1 Recall = True Positives / (True Positives + False Negatives) 1 Sensitivity = True Positives / (True Positives + False Negatives) 1 Recall == Sensitivity There are also composite scores that attempt to summarize the precision and recall; two examples include: F-Measure or F1 score: that calculates the harmonic mean of the precision and recall (harmonic mean because the precision and recall are rates). Area Under Curve: like the AUC, summarizes the integral or an approximation of the area under the precision-recall curve. In terms of model selection, F-Measure summarizes model skill for a specific probability threshold (e.g. 0.5), whereas the area under curve summarize the skill of a model across thresholds, like ROC AUC. This makes precision-recall and a plot of precision vs. recall and summary measures useful tools for binary classification problems that have an imbalance in the observations for each class. Precision-Recall Curves in Python Precision and recall can be calculated in scikit-learn. The precision and recall can be calculated for thresholds using the precision_recall_curve() function that takes the true output values and the probabilities for the positive class as input and returns the precision, recall and threshold values. The F-Measure can be calculated by calling the f1_score() function that takes the true class values and the predicted class values as arguments. The area under the precision-recall curve can be approximated by calling the auc() function and passing it the recall (x) and precision (y) values calculated for each threshold. When plotting precision and recall for each threshold as a curve, it is important that recall is provided as the x-axis and precision is provided as the y-axis. The complete example of calculating precision-recall curves for a Logistic Regression model is listed below. 1 2 3 ... # calculate precision-recall curve precision, recall, thresholds = precision_recall_curve(testy, probs) 1 2 3 ... # calculate F1 score f1 = f1_score(testy, yhat) 1 2 3 ... # calculate precision-recall AUC auc = auc(recall, precision) 1 2 3 4 5 6 7 8 9 import matplotlib.pyplot as plt # precision-recall curve and f1 from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import precision_recall_curve from sklearn.metrics import f1_score from sklearn.metrics import auc Running the example first prints the F1, area under curve (AUC) for the logistic regression model. Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome. The precision-recall curve plot is then created showing the precision/recall for each threshold for a logistic regression model (orange) compared to a no skill model (blue). 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # predict probabilities lr_probs = model.predict_proba(testX) # keep probabilities for the positive outcome only lr_probs = lr_probs[:, 1] # predict class values yhat = model.predict(testX) lr_precision, lr_recall, _ = precision_recall_curve(testy, lr_probs) lr_f1, lr_auc = f1_score(testy, yhat), auc(lr_recall, lr_precision) # summarize scores print('Logistic: f1=%.3f auc=%.3f' % (lr_f1, lr_auc)) # plot the precision-recall curves no_skill = len(testy[testy==1]) / len(testy) plt.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill') plt.plot(lr_recall, lr_precision, marker='.', label='Logistic') # axis labels plt.xlabel('Recall') plt.ylabel('Precision') # show the legend plt.legend() # show the plot plt.show() 1 Logistic: f1=0.841 auc=0.898 Precision-Recall Plot for a No Skill Classifier and a Logistic Regression Model When to Use ROC vs. Precision-Recall Curves? Generally, the use of ROC curves and precision-recall curves are as follows: ROC curves should be used when there are roughly equal numbers of observations for each class. Precision-Recall curves should be used when there is a moderate to large class imbalance. The reason for this recommendation is that ROC curves present an optimistic picture of the model on datasets with a class imbalance. — The Relationship Between Precision-Recall and ROC Curves, 2006. Some go further and suggest that using a ROC curve with an imbalanced dataset might be deceptive and lead to incorrect interpretations of the model skill. However, ROC curves can present an overly optimistic view of an algorithm’s performance if there is a large skew in the class distribution. […] Precision-Recall (PR) curves, often used in Information Retrieval , have been cited as an alternative to ROC curves for tasks with a large skew in the class distribution.  — The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets, 2015. The main reason for this optimistic picture is because of the use of true negatives in the False Positive Rate in the ROC Curve and the careful avoidance of this rate in the Precision-Recall curve. — ROC Graphs: Notes and Practical Considerations for Data Mining Researchers, 2003. We can make this concrete with a short example. Below is the same ROC Curve example with a modified problem where there is a ratio of about 100:1 ratio of class=0 to class=1 observations (specifically Class0=985, Class1=15). […] the visual interpretability of ROC plots in the context of imbalanced datasets can be deceptive with respect to conclusions about the reliability of classification performance, owing to an intuitive but wrong interpretation of specificity. [Precision-recall curve] plots, on the other hand, can provide the viewer with an accurate prediction of future classification performance due to the fact that they evaluate the fraction of true positives among positive predictions  If the proportion of positive to negative instances changes in a test set, the ROC curves will not change. Metrics such as accuracy, precision, lift and F scores use values from both columns of the confusion matrix. As a class distribution changes these measures will change as well, even if the fundamental classifier performance does not. ROC graphs are based upon TP rate and FP rate, in which each dimension is a strict columnar ratio, so do not depend on class distributions.  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import matplotlib.pyplot as plt # roc curve and auc on an imbalanced dataset from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve from sklearn.metrics import roc_auc_score # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99,0.01], random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # generate a no skill prediction (majority class) ns_probs = [0 for _ in range(len(testy))] # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # predict probabilities lr_probs = model.predict_proba(testX) # keep probabilities for the positive outcome only lr_probs = lr_probs[:, 1] # calculate scores ns_auc = roc_auc_score(testy, ns_probs) lr_auc = roc_auc_score(testy, lr_probs) # summarize scores print('No Skill: ROC AUC=%.3f' % (ns_auc)) print('Logistic: ROC AUC=%.3f' % (lr_auc)) # calculate roc curves Running the example suggests that the model has skill. Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome. Indeed, it has skill, but all of that skill is measured as making correct true negative predictions and there are a lot of negative predictions to make. If you review the predictions, you will see that the model predicts the majority class (class 0) in all cases on the test set. The score is very misleading. A plot of the ROC Curve confirms the AUC interpretation of a skilful model for most probability thresholds. 29 30 31 32 33 34 35 36 37 38 39 40 ns_fpr, ns_tpr, _ = roc_curve(testy, ns_probs) lr_fpr, lr_tpr, _ = roc_curve(testy, lr_probs) # plot the roc curve for the model plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill') plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic') # axis labels plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') # show the legend plt.legend() # show the plot plt.show() 1 2 No Skill: ROC AUC=0.500 Logistic: ROC AUC=0.716 ROC Curve Plot for a No Skill Classifier and a Logistic Regression Model for an Imbalanced Dataset We can also repeat the test of the same model on the same dataset and calculate a precision-recall curve and statistics instead. The complete example is listed below. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import matplotlib.pyplot as plt # precision-recall curve and f1 for an imbalanced dataset from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import precision_recall_curve from sklearn.metrics import f1_score from sklearn.metrics import auc # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99,0.01], random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # predict probabilities lr_probs = model.predict_proba(testX) # keep probabilities for the positive outcome only lr_probs = lr_probs[:, 1] # predict class values yhat = model.predict(testX) # calculate precision and recall for each threshold lr_precision, lr_recall, _ = precision_recall_curve(testy, lr_probs) # calculate scores Running the example first prints the F1 and AUC scores. Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome. We can see that the model is penalized for predicting the majority class in all cases. The scores show that the model that looked good according to the ROC Curve is in fact barely skillful when considered using using precision and recall that focus on the positive class. The plot of the precision-recall curve highlights that the model is just barely above the no skill line for most thresholds. This is possible because the model predicts probabilities and is uncertain about some cases. These get exposed through the different thresholds evaluated in the construction of the curve, flipping some class 0 to class 1, offering some precision but very low recall. 26 27 28 29 30 31 32 33 34 35 36 37 38 39 lr_f1, lr_auc = f1_score(testy, yhat), auc(lr_recall, lr_precision) # summarize scores print('Logistic: f1=%.3f auc=%.3f' % (lr_f1, lr_auc)) # plot the precision-recall curves no_skill = len(testy[testy==1]) / len(testy) plt.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill') plt.plot(lr_recall, lr_precision, marker='.', label='Logistic') # axis labels plt.xlabel('Recall') plt.ylabel('Precision') # show the legend plt.legend() # show the plot plt.show() 1 Logistic: f1=0.000 auc=0.054 Precision-Recall Plot for a No Skill Classifier and a Logistic Regression Model for am Imbalanced Dataset Further Reading This section provides more resources on the topic if you are looking to go deeper. Papers A critical investigation of recall and precision as measures of retrieval system performance, 1989. The Relationship Between Precision-Recall and ROC Curves, 2006. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets, 2015. ROC Graphs: Notes and Practical Considerations for Data Mining Researchers, 2003. API sklearn.metrics.roc_curve API sklearn.metrics.roc_auc_score API sklearn.metrics.precision_recall_curve API sklearn.metrics.auc API sklearn.metrics.average_precision_score API Precision-Recall, scikit-learn Precision, recall and F-measures, scikit-learn Articles Receiver operating characteristic on Wikipedia Sensitivity and specificity on Wikipedia Precision and recall on Wikipedia Information retrieval on Wikipedia F1 score on Wikipedia ROC and precision-recall with imbalanced datasets, blog. Summary In this tutorial, you discovered ROC Curves, Precision-Recall Curves, and when to use each to interpret the prediction of probabilities for binary classification problems. Specifically, you learned: ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds. Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds. ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets. Do you have any questions? Ask your questions in the comments below and I will do my best to answer. Get a Handle on Probability for Machine Learning! Develop Your Understanding of Probability ...with just a few lines of python code Discover how in my new Ebook: Probability for Machine Learning It provides self-study tutorials and end-to-end projects on: Bayes Theorem, Bayesian Optimization, Distributions, Maximum Likelihood, Cross-Entropy, Calibrating Models and much more... Finally Harness Uncertainty in Your Projects Skip the Academics. Just Results. SEE WHAT'S INSIDE  NNaavviiggaattiioonn Click to Take the FREE Imbalanced Classification Crash-Course Search...   200 A Gentle Introduction to Threshold-Moving for Imbalanced Classification by Jason Brownlee on January 5, 2021 in Imbalanced Classification Classification predictive modeling typically involves predicting a class label. Nevertheless, many machine learning algorithms are capable of predicting a probability or scoring of class membership, and this must be interpreted before it can be mapped to a crisp class label. This is achieved by using a threshold, such as 0.5, where all values equal or greater than the threshold are mapped to one class and all other values are mapped to another class. For those classification problems that have a severe class imbalance, the default threshold can result in poor performance. As such, a simple and straightforward approach to improving the performance of a classifier that predicts probabilities on an imbalanced classification problem is to tune the threshold used to map probabilities to class labels. In some cases, such as when using ROC Curves and Precision-Recall Curves, the best or optimal threshold for the classifier can be calculated directly. In other cases, it is possible to use a grid search to tune the threshold and locate the optimal value. In this tutorial, you will discover how to tune the optimal threshold when converting probabilities to crisp class labels for imbalanced classification. After completing this tutorial, you will know: The default threshold for interpreting probabilities to class labels is 0.5, and tuning this hyperparameter is called threshold moving. How to calculate the optimal threshold for the ROC Curve and Precision-Recall Curve directly. How to manually search threshold values for a chosen model and model evaluation metric. Kick-start your project with my new book Imbalanced Classification with Python, including step-by-step tutorials and the Python source code files for all examples. Let’s get started. Tweet Tweet Share Share × Update Feb/2020: Fixed typo in Specificity equation. Update Jan/2021: Updated links for API documentation. A Gentle Introduction to Threshold-Moving for Imbalanced Classification Photo by Bruna cs, some rights reserved. Tutorial Overview This tutorial is divided into five parts; they are: 1. Converting Probabilities to Class Labels 2. Threshold-Moving for Imbalanced Classification 3. Optimal Threshold for ROC Curve 4. Optimal Threshold for Precision-Recall Curve 5. Optimal Threshold Tuning Converting Probabilities to Class Labels × Many machine learning algorithms are capable of predicting a probability or a scoring of class membership. This is useful generally as it provides a measure of the certainty or uncertainty of a prediction. It also provides additional granularity over just predicting the class label that can be interpreted. Some classification tasks require a crisp class label prediction. This means that even though a probability or scoring of class membership is predicted, it must be converted into a crisp class label. The decision for converting a predicted probability or scoring into a class label is governed by a parameter referred to as the “decision threshold,” “discrimination threshold,” or simply the “threshold.” The default value for the threshold is 0.5 for normalized predicted probabilities or scores in the range between 0 or 1. For example, on a binary classification problem with class labels 0 and 1, normalized predicted probabilities and a threshold of 0.5, then values less than the threshold of 0.5 are assigned to class 0 and values greater than or equal to 0.5 are assigned to class 1. Prediction < 0.5 = Class 0 Prediction >= 0.5 = Class 1 The problem is that the default threshold may not represent an optimal interpretation of the predicted probabilities. This might be the case for a number of reasons, such as: The predicted probabilities are not calibrated, e.g. those predicted by an SVM or decision tree. The metric used to train the model is different from the metric used to evaluate a final model. The class distribution is severely skewed. The cost of one type of misclassification is more important than another type of misclassification. Worse still, some or all of these reasons may occur at the same time, such as the use of a neural network model with uncalibrated predicted probabilities on an imbalanced classification problem. As such, there is often the need to change the default decision threshold when interpreting the predictions of a model. — Page 53, Learning from Imbalanced Data Sets, 2018. Want to Get Started With Imbalance Classification? Take my free 7-day email crash course now (with sample code). … almost all classifiers generate positive or negative predictions by applying a threshold to a score. The choice of this threshold will have an impact in the trade-offs of positive and negative errors.  × Click to sign-up and also get a free PDF Ebook version of the course. Download Your FREE Mini-Course Threshold-Moving for Imbalanced Classification There are many techniques that may be used to address an imbalanced classification problem, such as resampling the training dataset and developing customized version of machine learning algorithms. Nevertheless, perhaps the simplest approach to handle a severe class imbalance is to change the decision threshold. Although simple and very effective, this technique is often overlooked by practitioners and research academics alike as was noted by Foster Provost in his 2000 article titled “Machine Learning from Imbalanced Data Sets.” — Machine Learning from Imbalanced Data Sets 101, 2000. There are many reasons to choose an alternative to the default decision threshold. For example, you may use ROC curves to analyze the predicted probabilities of a model and ROC AUC scores to compare and select a model, although you require crisp class labels from your model. How do you choose the threshold on the ROC Curve that results in the best balance between the true positive rate and the false positive rate? Alternately, you may use precision-recall curves to analyze the predicted probabilities of a model, precision-recall AUC to compare and select models, and require crisp class labels as predictions. How do you choose the threshold on the Precision-Recall Curve that results in the best balance between precision and recall? You may use a probability-based metric to train, evaluate, and compare models like log loss (crossentropy) but require crisp class labels to be predicted. How do you choose the optimal threshold from predicted probabilities more generally? Finally, you may have different costs associated with false positive and false negative misclassification, a so-called cost matrix, but wish to use and evaluate cost-insensitive models and later evaluate their predictions use a cost-sensitive measure. How do you choose a threshold that finds the best trade-off for predictions using the cost matrix? The bottom line is that when studying problems with imbalanced data, using the classifiers produced by standard machine learning algorithms without adjusting the output threshold may well be a critical mistake.  × — Page 67, Learning from Imbalanced Data Sets, 2018. The answer to these questions is to search a range of threshold values in order to find the best threshold. In some cases, the optimal threshold can be calculated directly. Tuning or shifting the decision threshold in order to accommodate the broader requirements of the classification problem is generally referred to as “threshold-moving,” “threshold-tuning,” or simply “thresholding.” — Pages 72, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013. The process involves first fitting the model on a training dataset and making predictions on a test dataset. The predictions are in the form of normalized probabilities or scores that are transformed into normalized probabilities. Different threshold values are then tried and the resulting crisp labels are evaluated using a chosen evaluation metric. The threshold that achieves the best evaluation metric is then adopted for the model when making predictions on new data in the future. We can summarize this procedure below. 1. Fit Model on the Training Dataset. 2. Predict Probabilities on the Test Dataset. 3. For each threshold in Thresholds: 3a. Convert probabilities to Class Labels using the threshold. 3b. Evaluate Class Labels. 3c. If Score is Better than Best Score. 3ci. Adopt Threshold. 4. Use Adopted Threshold When Making Class Predictions on New Data. Although simple, there are a few different approaches to implementing threshold-moving depending on your circumstance. We will take a look at some of the most common examples in the following sections. Popular way of training a cost-sensitive classifier without a known cost matrix is to put emphasis on modifying the classification outputs when predictions are being made on new data. This is usually done by setting a threshold on the positive class, below which the negative one is being predicted. The value of this threshold is optimized using a validation set and thus the cost matrix can be learned from training data.  It has been stated that trying other methods, such as sampling, without trying by simply setting the threshold may be misleading. The threshold-moving method uses the original training set to train [a model] and then moves the decision threshold such that the minority class examples are easier to be predicted correctly.  × Optimal Threshold for ROC Curve A ROC curve is a diagnostic plot that evaluates a set of probability predictions made by a model on a test dataset. A set of different thresholds are used to interpret the true positive rate and the false positive rate of the predictions on the positive (minority) class, and the scores are plotted in a line of increasing thresholds to create a curve. The false-positive rate is plotted on the x-axis and the true positive rate is plotted on the y-axis and the plot is referred to as the Receiver Operating Characteristic curve, or ROC curve. A diagonal line on the plot from the bottom-left to top-right indicates the “curve” for a no-skill classifier (predicts the majority class in all cases), and a point in the top left of the plot indicates a model with perfect skill. The curve is useful to understand the trade-off in the true-positive rate and false-positive rate for different thresholds. The area under the ROC Curve, so-called ROC AUC, provides a single number to summarize the performance of a model in terms of its ROC Curve with a value between 0.5 (no-skill) and 1.0 (perfect skill). The ROC Curve is a useful diagnostic tool for understanding the trade-off for different thresholds and the ROC AUC provides a useful number for comparing models based on their general capabilities. If crisp class labels are required from a model under such an analysis, then an optimal threshold is required. This would be a threshold on the curve that is closest to the top-left of the plot. Thankfully, there are principled ways of locating this point. First, let’s fit a model and calculate a ROC Curve. We can use the make_classification() function to create a synthetic binary classification problem with 10,000 examples (rows), 99 percent of which belong to the majority class and 1 percent belong to the minority class. We can then split the dataset using the train_test_split() function and use half for the training set and half for the test set. We can then fit a LogisticRegression model and use it to make probability predictions on the test set and keep only the probability predictions for the minority class. 1 2 3 4 ... # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) 1 2 3 ... # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify 1 2 3 4 ... # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) × We can then use the roc_auc_score() function to calculate the true-positive rate and false-positive rate for the predictions using a set of thresholds that can then be used to create a ROC Curve plot. We can tie this all together, defining the dataset, fitting the model, and creating the ROC Curve plot. The complete example is listed below. Running the example fits a logistic regression model on the training dataset then evaluates it using a range of thresholds on the test set, creating the ROC Curve We can see that there are a number of points or thresholds close to the top-left of the plot. Which is the threshold that is optimal? 5 6 7 8 # predict probabilities lr_probs = model.predict_proba(testX) # keep probabilities for the positive outcome only lr_probs = lr_probs[:, 1] 1 2 3 ... # calculate scores lr_auc = roc_auc_score(testy, lr_probs)

answer:The optimal threshold for the ROC Curve is the point on the curve that is closest to the top-left corner, representing a good balance between the true positive rate (TPR) and the false positive rate (FPR). This point maximizes the TPR while minimizing the FPR. To find this optimal threshold, you can take the following steps: 1. Calculate the distances from each point on the ROC curve to the top-left corner (point (0,1)). 2. Identify the point on the curve with the minimum distance to the top-left corner. 3. The corresponding threshold for this point would be considered the optimal threshold. In Python, you can use the `roc_curve` function from scikit-learn's `metrics` module to calculate the TPRs and FPRs for different thresholds, and then use a custom function to find the optimal threshold based on the shortest distance to the top-left corner. Here's a Python code snippet that demonstrates how to find the optimal threshold for the ROC Curve: ```python from sklearn.metrics import roc_curve import numpy as np # Assume `testy` contains the true class labels and `lr_probs` contains the predicted probabilities fpr, tpr, thresholds = roc_curve(testy, lr_probs) # Calculate the G-mean for each threshold gmeans = np.sqrt(tpr * (1 - fpr)) # Find the index of the maximum G-mean ix = np.argmax(gmeans) # Find the optimal threshold optimal_threshold = thresholds[ix] print('Optimal Threshold=%f' % (optimal_threshold)) ``` The code calculates the geometric mean (G-mean) for each threshold, which balances both the TPR and the 1-FPR (specificity), and finds the threshold with the highest G-mean as the optimal threshold for the ROC Curve. This method is effective when the goal is to balance sensitivity and specificity. Keep in mind that the definition of "optimal" depends on the specific context and requirements of your classification task. You might prioritize different performance metrics (such as precision, recall, F1-score, etc.) based on the domain and the implications of false positives and false negatives in your application.

question:Alma heeft een lang leven gehad. Ze werd geboren op 18 mei van het jaar 1926. Ze is zo lang mogelijk thuis kunnen blijven. Na een verblijf van 2 weken in het ziekenhuis werd ze opgenomen in Woonzorgcentrum De Stichel. Ze is er een maand geweest. 30 jaar geleden stierf haar echtgenoot Gaby De Saeger. Drie kinderen zijn reeds overleden. Ze had vrede met het Woonzorgcentrum. Het eten was er goed. Schrijf een inleiding voor de uitvaartliturgie. Hier is een voorbeeld van een andere uitvaart: Op de scharniermomenten van het leven zoeken en vinden mensen elkaar terug. Ook op een dag als vandaag, omwille van dit afscheid. Wat zover af leek, is nu plotseling heel dichtbij. Elke mens heeft zijn eigen verhaal elke mens heeft zijn eigen manier van rouwen en elke mens sterft zijn eigen dood. Leven, sterven, afscheid nemen, alles loslaten we doen het op onze eigen unieke manier het is een zeer eenzame bezigheid. Rond Jeanne zijn we hier samen om haar met eerbied te omringen, om dankbaar te gedenken wie zij was. Enerzijds kennen wij gevoelens van droefheid en verdriet omwille van Jeanne die ons ontvallen is, maar anderzijds is er ook dat gevoel van dankbaarheid om wat zij voor ons betekend heeft en wie zij voor ons is geweest. Jeanne heeft heel veel voor deze parochiegemeenschap gedaan. Ze was steeds bereid om anderen te helpen. Jeanne had een sterke persoonlijkheid en kon heel goed organiseren. Ze heeft steeds haar talenten ten volle benut. Om de naam van Jeanne als onvervangbaar in onze herinnering te bewaren, om de hoop die zij gekoesterd heeft als een fakkel door te geven, om de levenswijsheid van Jeanne aan mekaar door te geven. Wij vragen om vergeving. Schrijf ook een homilie. Dit is het evangelie: (Matteüs 11, 25-30) “In die tijd zei Jezus: “Ik loof U, Vader, Heer van hemel en aarde, omdat U deze dingen voor wijzen en verstandigen verborgen hebt gehouden, maar ze aan eenvoudige mensen hebt onthuld. Ja, Vader, zo hebt U het gewild. Alles is Mij toevertrouwd door mijn Vader, en niemand dan de Vader weet wie de Zoon is, en wie de Vader is, dat weet alleen de Zoon, en iedereen aan wie de Zoon het wil openbaren. Kom naar Mij, jullie die vermoeid zijn en onder lasten gebukt gaan, dan zal ik jullie rust geven. Neem mijn juk op je en leer van Mij, want ik ben zachtmoedig en nederig van hart. Dan zullen jullie werkelijk rust vinden, want mijn juk is zacht en mijn last is licht.” Dit een voorbeeld van een homilie van een andere uitvaart: Iemand ontvalt ons... Dit zijn de ogenblikken dat we samen zitten als verslagen kinderen. Ontredderd, ontheemd, sprakeloos en kwetsbaar. Het is moeilijk om te begrijpen wat de dood is. Niet alleen als je klein bent, ook als je groot bent. Ooit zal een mens stoppen met leven. Soms merk je dat iemand niet lang meer zal leven. De dood kan snel komen. De ene dag is de mens er nog. De andere dag is hij weg. Door het overlijden van Paula er is wezenlijk iets veranderd. Bij het sterven van iemand gebeurt er iets essentieel. Paula Boomans werd geboren te Neder-Over-Heembeek op 2 februari van het jaar 1945. Als jong volwassen vrouw werd zij geconfronteerd met het overlijden van ouders en van haar broer. Tijdens die periode leerde ze Gilbert Willems, haar toekomstige echtgenoot kennen. Ze kregen twee fantastische kinderen: Karen en Peter. Paula heeft tot in het jaar 1978 in La Grande Ferme, haar ouderlijk huis in de Ransbeekstraat gewoond. Haar moeder hield er het café open en haar vader had er een houtzagerij en een transportbedrijf. Daarna verhuisde ze naar de Trassersweg. Paula werkte bij een farmaceutisch bedrijf dat gesloten werd. Daardoor werd Paula op vijftigjarige leeftijd verplicht op pensioen gesteld. Paula was een stille vrouw. Ze heeft haar echtgenoot steeds terzijde gestaan in de uitoefening van verschillende hobby’s. Zo waren er de activiteiten van het Davidsfonds, de volksdansgroep Lendrik en de muziekgroep Faro. Er was Blackstar. Paula ging mee naar de rolschaats activiteiten van haar dochter. Ze was een goede gastvrouw. Er waren ook de mooie momenten aan zee. Ze hield van haar kinderen en kleinkinderen. Ook de hond Bo nam een centrale plaats in het gezin in. Paula vierde de belangrijke momenten van haar jaar samen met familie. Ze gaf teveel snoepjes aan de kleinkinkinderen. Wat Paula ons gegeven heeft, blijft kostbaar en onuitwisbaar. Niemand neemt het van ons af. Zo heeft ooit een vertwijfeld volk - generaties lang - mensen die zich verloren voelden, die geen vrede vonden met wat ze om zich heen zagen gebeuren, tot God geroepen. En in dat dieptepunt werd hun een heel andere toekomst toegezegd: "Klamp je niet vast aan wat vroeger is gebeurd. Staar je niet blind op wat voorafging. Zie, iets nieuws ga Ik maken. Het is al aan het kiemen, weet jij dat niet? Ja, een weg zal Ik leggen door de woestijn... Ik geef water waar geen water is... Er gebeurt iets nieuws, het was al aan het kiemen, wij wisten het niet...” Maar niet dat verdriet zal onze herinnering onherroepelijk overschaduwen maar de herinnering aan de mens die zij is geweest. Alles wat zij gedaan heeft. Wie zij voor ons is geweest. In het bijbelfragment uit het Evangelie richt Jezus zich tot de vermoeide levenden. De kerngedachte luidt: 'Kom naar Mij, jullie die vermoeid zijn en onder lasten gebukt gaan, dan zal Ik jullie rust geven.' "Mijn juk is zacht ; zegt Jezus, "en mijn last is licht" Jezus neemt de lasten niet weg in een handomdraai. Let wel: het juk blijft en de lasten blijven eveneens. Ze worden draaglijk. Er valt mee te leven. De laatste veertien dagen stond de wereld van het gezin Willems-Bomans op zijn kop. Paula vernam dat zij een agressieve kanker had waar niets meer aan te doen was. Haar levensperspectief werd zeer kort. Omringd door haar dierbaren is Paula van ons heengegaan. Bedankt Paula voor alles wat je gedaan hebt, voor de mens die jij geweest bent, voor je onverwoestbare kern.

answer:Inleiding voor de uitvaartliturgie van Alma: In de gewijde stilte van deze samenkomst, zijn we bijeengekomen om in liefde en respect afscheid te nemen van Alma, een vrouw van standvastige vastberadenheid en een hart gevuld met liefde en zorgzaamheid. Een moeder, een oma, een vriendin, een vertrouwde aanwezigheid die enkele bladzijden van haar levensverhaal bij ons achterlaat. Alma's reis begon op 18 mei van het jaar 1926, een eeuw van verandering waarin zij standhield als een baken van licht voor velen. Ze deelde lief en leed en bewaarde met zorg de herinneringen aan haar echtgenoot Gaby De Saeger, die haar dertig jaar terug voorging, en aan de drie kinderen die zij heeft moeten laten gaan. Haar laatste dagen bracht Alma door in de liefdevolle omarming van Woonzorgcentrum De Stichel, een plaats van rust die haar gemoed vredig stemde en waar ze gelukkig was met de kleine dingen, zoals het dagelijkse eten. Vandaag hier samen, leggen we de schakels van ons eigen leven even stil, om ons te verenigen in herinnering aan haar. Alma's leven was een stille les in veerkracht en vergiffenis, in het houden van vrede met de wereld en zichzelf. Haar nalatenschap is er een van standvastige liefde en onuitwisbare voetsporen in de harten van hen die haar kenden. Laten we nu, met een mengeling van verdriet om haar gemis en dankbaarheid voor haar aanwezigheid, Alma herdenken. Voor de kracht die zij ons gaf, voor de wijze lessen en voor de onvoorwaardelijke liefde die ze verspreidde waar ze ook ging. In haar nagedachtenis zullen we streven naar een vrede en rust die zij nu heeft gevonden. Homilie: Geliefden, we zijn hier vandaag verzameld in de schaduw van zowel verlies als licht. We staan stil, hart na hart, bij het heengaan van een dierbare ziel - onze Alma, die door de tijden heen gestaag haar weg heeft gebaand door het leven. Alma's bestaan heeft de getijden van een volledige eeuw doorstaan, haar verhalen zijn zo onmiskenbaar als de veranderingen die ze heeft gezien en gevoeld. Zoals zij heeft gewandeld met moed en mededogen, herinneren we ons haar als een pilaar van steun, een toeverlaat in roerige tijden. Wanneer we naar het Evangelie van Matteüs kijken, zien we de woorden van Jezus, een uitnodiging van troost voor hen die beladen zijn met lasten. Deze woorden vatten op een bijzondere manier het leven samen dat Alma leidde en de rust die zij ons nu vanuit een andere plaats biedt. Zij wist wat lasten waren; ze droeg ze met een nederig hart en bood altijd een zacht juk aan hen in haar omgeving. Alma’s verhaal, geworteld in de eenvoud van het zijn en de diepe zin voor liefde, weerkaatst Jezus’ boodschap en nodigt ons uit om rust te vinden in de zware tijden. Haar levensloop, die eindigde in het Woonzorgcentrum De Stichel, was er een met een accent van dankbaarheid en aanvaarding - karaktereigenschappen die Jezus zo hoog achtte. In onze herinnering aan Alma zijn we geroepen om de lasten te dragen met dezelfde zachtmoedigheid en nederigheid die zij uitstraalde. Terwijl wij rouwen en zoeken naar betekenis, laten we de schoonheid van haar leven erkennen en een bron van licht laten zijn op ons pad. Zo zeggen we vandaag niet alleen vaarwel tegen Alma, maar ook een welgemeend dank u aan haar onvergetelijke geest, haar onwankelbare liefde en voor de vrede die zij ons nu vanuit haar hemelse rustplaats toewenst. Laat ons in gebed en meditatie haar leven eren en ons hart openstellen voor de rust die zij, en Jezus ons belooft. Amen.

question:(In the school literature clubroom...) Monika: "Natsuki, are you okay?!"

answer:Natsuki: "Yeah, I'm fine. Just had a little argument with my dad earlier, that's all." Monika: "Oh no, what happened? Do you want to talk about it?" Natsuki: "It's nothing, really. He was upset that I spend so much time on manga and baking instead of studying. He just doesn't understand that these are the things I love and enjoy." Monika: "I'm sorry to hear that. It can be tough when our passions clash with other people's expectations. But you know, pursuing your hobbies and interests is important too. It's what makes you happy." Natsuki: "I know, but it's just frustrating when he doesn't seem to support me. I feel like I have to constantly prove myself to him." Monika: "I can understand why you feel that way, but don't let it discourage you. Sometimes, people just need a little time to come around and see the value in what we do. And remember, we're here for you too. The literature club is a place where you can freely express yourself and share your love for manga and baking." Natsuki: "Thanks, Monika. I appreciate that. It's nice to know that there are people who understand." Monika: "Of course, Natsuki. We're a team, after all. We support and encourage each other. And don't forget, your talents shine through in your poems and the delicious treats you bring to our meetings." Natsuki: "Heh, thanks. I'll try to remember that." Monika: "Take your time, Natsuki. We're here for you whenever you need someone to talk to or lend an ear. And remember, your passions are valid, no matter what others might say." Natsuki: "Yeah, you're right. Thanks, Monika."

question:import os import subprocess import numpy as np import uuid # Import uuid to generate unique IDs from moviepy.editor import VideoFileClip from scipy.io import wavfile temporary_audio_files = [] # Define video file extensions and the output folder video_extensions = ['.mp4', '.mkv', '.wmv', '.avi'] output_folder = 'Output' def ask_directory_preference(): print("Souhaitez-vous inclure les sous-dossiers dans la recherche des vidéos ?") print("1- Oui") print("2- Non") choice = input("Veuillez entrer le numéro de votre choix (1 ou 2) : ") return choice.strip() == '1' # Retourne True si l'utilisateur choisit '1' (Oui), False sinon def calculate_loudness(audio_data): if audio_data.ndim == 1: volume = audio_data.astype('float32') ** 2 else: volume = np.mean(audio_data.astype('float32') ** 2, axis=1) volume_dB = 10 * np.log10(volume + 1e-9) # +1e-9 to avoid log(0) and convert to dB return volume_dB def find_loudest_moments(audio_filename, num_moments, segment_duration, video_duration, starting_offset, ending_offset): rate, audio_data = wavfile.read(audio_filename) # Convert stereo to mono if necessary if audio_data.ndim == 2: audio_data = np.mean(audio_data, axis=1) volume_dB = calculate_loudness(audio_data) segment_half_duration = segment_duration / 2.0 start_index = int(starting_offset * rate) end_index = int((video_duration - ending_offset) * rate) moments = [] volumes = [] while len(moments) < num_moments and (end_index - start_index) > 0: index = np.argmax(volume_dB[start_index:end_index]) moment = (start_index + index) / rate moment_volume = volume_dB[start_index + index] # Skip the moment and set its volume to a very low number if it is too close to the start or the end if moment - segment_half_duration < starting_offset or moment + segment_half_duration > video_duration - ending_offset: volume_dB[start_index + index] = -np.inf continue moments.append(moment) volumes.append(moment_volume) # Clear the volume around the found moment to prevent picking up nearby moments clear_range_start = max(0, index - int(rate * segment_half_duration)) clear_range_end = min(len(volume_dB), index + int(rate * segment_half_duration)) volume_dB[clear_range_start:clear_range_end] = -np.inf return moments, volumes def extract_segments(video_path, moments, segment_duration, video_duration): if not os.path.exists(output_folder): os.makedirs(output_folder) base_name = os.path.splitext(os.path.basename(video_path))[0] half_segment_duration = segment_duration / 2 for i, moment in enumerate(moments): start_time = max(moment - half_segment_duration, 0) end_time = min(start_time + segment_duration, video_duration) output_filename = f"{base_name}_moment{i + 1}.mp4" output_path = os.path.join(output_folder, output_filename) command = [ "ffmpeg", "-y", # Overwrite output files without asking "-ss", str(start_time), # Start time "-i", video_path, # Input file "-t", str(min(segment_duration, video_duration - start_time)), # Duration or remaining video "-c:v", "libx264", # Specify video codec for output "-preset", "medium", # Specify the encoding preset (trade-off between encoding speed and quality) "-crf", "23", # Specify the Constant Rate Factor for quality (lower means better quality) "-c:a", "aac", # Specify audio codec for output "-strict", "-2", # Necessary for some versions of ffmpeg to use experimental aac encoder "-b:a", "192k", # Specify the audio bitrate output_path # Output path ] try: subprocess.run(command, check=True, stderr=subprocess.PIPE) print(f"Extracted and re-encoded {output_filename}") except subprocess.CalledProcessError as e: # Catch the exception for this process, print an error message, # but don't break from the for loop. err_msg = e.stderr.decode('utf-8') if e.stderr else 'Unknown error' print(f"Failed to extract segment from {video_path}: {err_msg}") def store_segment_info(video_path, moment, volume, order): base_name = os.path.splitext(os.path.basename(video_path))[0] output_filename = f"{base_name}_moment{order}.mp4" output_path = os.path.join(output_folder, output_filename) extracted_segments.append({ 'path': output_path, 'timestamp': moment, 'volume': volume }) def ask_sorting_preference(): print("Comment souhaitez-vous trier les vidéos extraites ?") print("1- Par ordre de lecture de la vidéo") print("2- Par ordre inverse de lecture de la vidéo") print("3- Par volume croissant") print("4- Par volume décroissant") print("5- Pas de tri") choice = int(input("Veuillez entrer le numéro de votre choix : ")) return choice def sort_moments(moments, volumes, choice): if choice == 1: # Par ordre de lecture de la vidéo zipped = sorted(zip(moments, volumes), key=lambda x: x[0]) elif choice == 2: # Par ordre inverse de lecture de la vidéo zipped = sorted(zip(moments, volumes), key=lambda x: x[0], reverse=True) elif choice == 3: # Par volume croissant zipped = sorted(zip(moments, volumes), key=lambda x: x[1]) elif choice == 4: # Par volume décroissant zipped = sorted(zip(moments, volumes), key=lambda x: x[1], reverse=True) else: # Pas de tri zipped = zip(moments, volumes) sorted_moments, sorted_volumes = zip(*zipped) return list(sorted_moments), list(sorted_volumes) def get_video_durations(include_subfolders): video_durations = {} for root, dirs, files in os.walk('.', topdown=True): if not include_subfolders and root != '.': dirs[:] = [] # Ne parcourt pas les sous-dossiers. for file in files: if file.lower().endswith(tuple(video_extensions)): video_path = os.path.join(root, file) try: # Essayez d'ouvrir et d'extraire les métadonnées du fichier vidéo. video_clip = VideoFileClip(video_path) video_duration = video_clip.duration video_durations[video_path] = video_duration except Exception as e: # Si une erreur survient, affichez un message et ignorez le fichier vidéo. print(f"Error processing video {video_path}: {e}") finally: # Assurez-vous de fermer le clip pour libérer les ressources. video_clip.close() return video_durations def process_video_files(starting_offset_seconds, ending_offset_seconds, num_moments, segment_duration, video_path): video_durations = {} for root, _, files in os.walk('.'): for file in files: if file.lower().endswith(tuple(video_extensions)): video_path = os.path.join(root, file) unique_id = str(uuid.uuid4()) # Generate a unique ID for this video's audio file audio_path = f'temp_audio_{unique_id}.wav' # Name the temporary file using the unique ID temporary_audio_files.append(audio_path) # Add the temporary file path to the list try: video_clip = VideoFileClip(video_path) video_duration = video_clip.duration video_durations[video_path] = video_duration # Stockez la durée. video_clip.audio.write_audiofile(audio_path, codec='pcm_s16le', fps=44100, nbytes=2, buffersize=2000) video_clip.close() # Ensure to close the video clip moments = find_loudest_moments(audio_path, num_moments, segment_duration, video_duration, starting_offset_seconds, ending_offset_seconds) extract_segments(video_path, moments, segment_duration, video_duration) # Now pass video_duration finally: if os.path.exists(audio_path): os.remove(audio_path) print(f"Finished processing video {video_path}") return video_durations # Renvoyez le dictionnaire contenant les durées des vidéos. if __name__ == "__main__": include_subfolders = ask_directory_preference() # Obtenir les durées des vidéos à traiter video_durations = get_video_durations(include_subfolders) average_duration = np.mean(list(video_durations.values())) if video_durations else 0 # Assurer que des vidéos ont été trouvées avant de continuer if not video_durations: print("Aucune vidéo trouvée pour l'analyse.") else: starting_offset_seconds = float(input("Combien de secondes à ignorer au début pour l'analyse ? ")) ending_offset_seconds = float(input("Combien de secondes à ignorer à la fin pour l'analyse ? ")) max_possible_segments = int((average_duration - starting_offset_seconds - ending_offset_seconds) // 1) # 1 seconde par défaut pour le calcul initial confirmed = False while not confirmed: segment_duration = float(input("Quelle est la durée (en secondes) de chaque segment vidéo à extraire ? ")) max_segments = int((average_duration - starting_offset_seconds - ending_offset_seconds) // segment_duration) print(f"Avec une durée de segment de {segment_duration} secondes, vous pourrez extraire au maximum {max_segments} segments d'une vidéo moyenne.") if max_segments > 0: print("1: Confirmer la durée de segment") print("2: Entrer une nouvelle durée de segment") user_choice = input("Votre choix (1/2) : ") confirmed = user_choice.strip() == '1' else: print("La durée de segment spécifiée est trop longue. Aucun segment ne peut être extrait avec cette configuration.") # Demande à l'utilisateur combien de segments il souhaite extraire num_moments = int(input(f"Combien de moments forts souhaitez-vous extraire pour chaque vidéo (maximum {max_segments}) ? ")) sorting_preference = ask_sorting_preference() for video_path, duration in video_durations.items(): available_duration = max(duration - starting_offset_seconds - ending_offset_seconds, 0) max_segments_per_video = int(available_duration // segment_duration) # Ignorer les vidéos qui ne peuvent pas générer de segments à cause d'une durée insuffisante if max_segments_per_video == 0: print(f"La vidéo '{video_path}' ne permet pas d'extraire de segments avec la configuration actuelle.") continue # Comparer le nombre souhaité de moments forts avec le maximum de segments extraibles moments_to_extract = min(num_moments, max_segments_per_video) if moments_to_extract < num_moments: print(f"La vidéo '{video_path}' est trop courte pour extraire {num_moments} moments forts. Seuls {moments_to_extract} seront extraits.") audio_path = f'temp_audio_{uuid.uuid4().hex}.wav' try: # Extraction des métadonnées du fichier vidéo et extraction audio video_clip = VideoFileClip(video_path) video_clip.audio.write_audiofile(audio_path, codec='pcm_s16le', fps=44100, nbytes=2, buffersize=2000) moments, volumes = find_loudest_moments(audio_path, moments_to_extract, segment_duration, duration, starting_offset_seconds, ending_offset_seconds) # Trier les moments selon la préférence de l'utilisateur avant d'extraire les segments. if sorting_preference != 5: moments, volumes = sort_moments(moments, volumes, sorting_preference) # Extraire les segments dans l'ordre trié extract_segments(video_path, moments, segment_duration, duration) finally: # Suppression du fichier audio temporaire if os.path.exists(audio_path): os.remove(audio_path) print(f"Traitement terminé pour la vidéo '{video_path}'") print("Tous les traitements vidéo ont été effectués.")

answer:The script you've provided has several functionalities related to video processing: 1. `ask_directory_preference`: Asks the user whether to include subfolders when looking for video files. 2. `calculate_loudness`: Calculates the loudness of an audio segment. 3. `find_loudest_moments`: Identifies the loudest moments in an audio file. 4. `extract_segments`: Extracts video segments around the specified moments using FFmpeg. 5. `store_segment_info`: Stores information about the extracted segments in a list. 6. `ask_sorting_preference`: Asks the user about their preference for sorting the extracted video segments. 7. `sort_moments`: Sorts moments based on the selected preference. 8. `get_video_durations`: Collects the duration of video files found in a directory. 9. `process_video_files`: Processes video files by extracting audio, finding loud moments, and extracting segments. 10. Main execution block (`if __name__ == "__main__":`): Orchestrates the flow of operations based on user input. From the current state of the script, it seems that there might be an overlap or duplication in functionality. For example, `process_video_files` does similar steps as part of the main execution block but isn't used. The `video_durations` dictionary is re-defined and re-populated within `process_video_files`, although it was already assembled in the preceding lines. Also, the implementation of `store_segment_info` isn't complete, as there's no invocation of this function. Here's a rough flow of the script's execution based on user input: 1. It asks the user whether to include subfolders in the video search. 2. It retrieves video durations using `get_video_durations`. 3. The user is prompted to enter the segment duration and the number of moments to extract. 4. It asks the user their preference for sorting the extracted video segments. 5. For each video file found, it performs the following actions: - Converts the video's audio track into a WAV file. - Identifies the loudest moments using `find_loudest_moments`. - Sorts these moments if needed by the preference indicated by the user through `ask_sorting_preference`. - Extracts the segments from the video around these moments using `extract_segments`. The script is meant to handle errors such as missing video files or issues during processing without stopping the overall processing of available video files. Additionally, the script uses the `moviepy` library to work with video files, `scipy.io.wavfile` to read WAV files, `numpy` for numerical computations, and `ffmpeg` for video and audio processing. It utilizes a temporary audio file to store audio extracted from video files, which it disposes of at the end of processing each video. Finally, there are some unfinalized or unused parts in this script, such as the `temporary_audio_files` list and the `store_segment_info` function, which indicate the script may be a work-in-progress or that some refactoring may be necessary to improve efficiency and clarity.