Hadoop . The model i created shows an AUC (Area under the curve) of 0.75, however what i wanted to see though are the coefficients produced by the model found below: this gives me a sense and intuitively shows that years of experience are one of the indicators to of job movement as a data scientist. StandardScaler removes the mean and scales each feature/variable to unit variance. Power BI) and data frameworks (e.g. We will improve the score in the next steps. Machine Learning, If nothing happens, download Xcode and try again. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. sign in Determine the suitable metric to rate the performance from the model. This is the story of life.<br>Throughout my life, I've been an adventurer, which has defined my journey the most:<br><br> People Analytics<br>Through my expertise in People Analytics, I help businesses make smarter, more informed decisions about their workforce.<br>My . Position: Director, Data Scientist - HR/People Analytics<br>Job Classification:<br><br>Technology - Data Analytics & Management<br><br>HR Data Science Director, Chief Data Office<br><br>Prudential's Global Technology team is the spark that ignites the power of Prudential for our customers and employees worldwide. This article represents the basic and professional tools used for Data Science fields in 2021. Third, we can see that multiple features have a significant amount of missing data (~ 30%). A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. If nothing happens, download GitHub Desktop and try again. I used Random Forest to build the baseline model by using below code. Feature engineering, This is the violin plot for the numeric variable city_development_index (CDI) and target. Information related to demographics, education, experience are in hands from candidates signup and enrollment. There was a problem preparing your codespace, please try again. has features that are mostly categorical (Nominal, Ordinal, Binary), some with high cardinality. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. There are a total 19,158 number of observations or rows. 2023 Data Computing Journal. This dataset designed to understand the factors that lead a person to leave current job for HR researches too. For the third model, we used a Gradient boost Classifier, It relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. Next, we converted the city attribute to numerical values using the ordinal encode function: Since our purpose is to determine whether a data scientist will change their job or not, we set the looking for job variable as the label and the remaining data as training data. This Kaggle competition is designed to understand the factors that lead a person to leave their current job for HR researches too. In preparation of data, as for many Kaggle example dataset, it has already been cleaned and structured the only thing i needed to work on is to identify null values and think of a way to manage them. Second, some of the features are similarly imbalanced, such as gender. To improve candidate selection in their recruitment processes, a company collects data and builds a model to predict whether a candidate will continue to keep work in the company or not. StandardScaler is fitted and transformed on the training dataset and the same transformation is used on the validation dataset. Employees with less than one year, 1 to 5 year and 6 to 10 year experience tend to leave the job more often than others. Then I decided the have a quick look at histograms showing what numeric values are given and info about them. We used this final model to increase our AUC-ROC to 0.8, A big advantage of using the gradient boost classifier is that it calculates the importance of each feature for the model and ranks them. The dataset is imbalanced and most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. Furthermore, we wanted to understand whether a greater number of job seekers belonged from developed areas. Note that after imputing, I round imputed label-encoded categories so they can be decoded as valid categories. Job Analytics Schedule Regular Job Type Full-time Job Posting Jan 10, 2023, 9:42:00 AM Show more Show less (including answers). Are you sure you want to create this branch? I made a stackplot for each categorical feature and target, but for the clarity of the post I am only showing the stackplot for enrolled_course and target. Description of dataset: The dataset I am planning to use is from kaggle. Some of them are numeric features, others are category features. Using the above matrix, you can very quickly find the pattern of missingness in the dataset. First, the prediction target is severely imbalanced (far more target=0 than target=1). This will help other Medium users find it. This project include Data Analysis, Modeling Machine Learning, Visualization using SHAP using 13 features and 19158 data. To predict candidates who will change job or not, we can't use simple statistic and need machine learning so company can categorized candidates who are looking and not looking for a job change. Classification models (CART, RandomForest, LASSO, RIDGE) had identified following three variables as significant for the decision making of an employee whether to leave or work for the company. A company is interested in understanding the factors that may influence a data scientists decision to stay with a company or switch jobs. Work fast with our official CLI. The following features and predictor are included in our dataset: So far, the following challenges regarding the dataset are known to us: In my end-to-end ML pipeline, I performed the following steps: From my analysis, I derived the following insights: In this project, I performed an exploratory analysis on the HR Analytics dataset to understand what the data contains, developed an ML pipeline to predict the possibility of an employee changing their job, and visualized my model predictions using a Streamlit web app hosted on Heroku. We achieved an accuracy of 66% percent and AUC -ROC score of 0.69. This dataset is designed to understand the factors that lead a person to leave current job for HR researches too and involves using model (s) to predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. This allows the company to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates.. Predict the probability of a candidate will work for the company You signed in with another tab or window. Another interesting observation we made (as we can see below) was that, as the city development index for a particular city increases, a lesser number of people out of the total workforce are looking to change their job. How much is YOUR property worth on Airbnb? AVP, Data Scientist, HR Analytics. What is the total number of observations? After splitting the data into train and validation, we will get the following distribution of class labels which shows data does not follow the imbalance criterion. Each employee is described with various demographic features. with this I looked into the Odds and see the Weight of Evidence that the variables will provide. A not so technical look at Big Data, Solving Data Science ProblemsSeattle Airbnb Data, Healthcare Clearinghouse Companies Win by Optimizing Data Integration, Visualizing the analytics of chupacabras story production, https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015. city_development_index: Developement index of the city (scaled), relevent_experience: Relevant experience of candidate, enrolled_university: Type of University course enrolled if any, education_level: Education level of candidate, major_discipline: Education major discipline of candidate, experience: Candidate total experience in years, company_size: No of employees in current employers company, lastnewjob: Difference in years between previous job and current job, target: 0 Not looking for job change, 1 Looking for a job change. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. When creating our model, it may override others because it occupies 88% of total major discipline. Our model could be used to reduce the screening cost and increase the profit of institutions by minimizing investment in employees who are in for the short run by: Upon an initial analysis, the number of null values for each of the columns were as following: Besides missing values, our data also contained entries which had categorical data in certain columns only. Metric Evaluation : Data Source. Furthermore,. Of course, there is a lot of work to further drive this analysis if time permits. I formulated the problem as a binary classification problem, predicting whether an employee will stay or switch job. Data set introduction. sign in Simple countplots and histogram plots of features can give us a general idea of how each feature is distributed. Thats because I set the threshold to a relative difference of 50%, so that labels for groups with small differences wont clutter up the plot. Job. Company wants to increase recruitment efficiency by knowing which candidates are looking for a job change in their career so they can be hired as data scientist. The relatively small gap in accuracy and AUC scores suggests that the model did not significantly overfit. Question 2. Are you sure you want to create this branch? Hr-analytics-job-change-of-data-scientists | Kaggle Explore and run machine learning code with Kaggle Notebooks | Using data from HR Analytics: Job Change of Data Scientists Do years of experience has any effect on the desire for a job change? Through the above graph, we were able to determine that most people who were satisfied with their job belonged to more developed cities. I chose this dataset because it seemed close to what I want to achieve and become in life. HR-Analytics-Job-Change-of-Data-Scientists-Analysis-with-Machine-Learning, HR Analytics: Job Change of Data Scientists, Explainable and Interpretable Machine Learning, Developement index of the city (scaled). In order to control for the size of the target groups, I made a function to plot the stackplot to visualize correlations between variables. 10-Aug-2022, 10:31:15 PM Show more Show less We believed this might help us understand more why an employee would seek another job. Tags: In this article, I will showcase visualizing a dataset containing categorical and numerical data, and also build a pipeline that deals with missing data, imbalanced data and predicts a binary outcome. Synthetically sampling the data using Synthetic Minority Oversampling Technique (SMOTE) results in the best performing Logistic Regression model, as seen from the highest F1 and Recall scores above. Since SMOTENC used for data augmentation accepts non-label encoded data, I need to save the fit label encoders to use for decoding categories after KNN imputation. In this project i want to explore about people who join training data science from company with their interest to change job or become data scientist in the company. Learn more. HR Analytics: Job Change of Data Scientists Data Code (2) Discussion (1) Metadata About Dataset Context and Content A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Prudential 3.8. . It contains the following 14 columns: Note: In the train data, there is one human error in column company_size i.e. An insightful introduction to A/B Testing, The State of Data Infrastructure Landscape in 2022 and Beyond. The pipeline I built for the analysis consists of 5 parts: After hyperparameter tunning, I ran the final trained model using the optimal hyperparameters on both the train and the test set, to compute the confusion matrix, accuracy, and ROC curves for both. Taking Rumi's words to heart, "What you seek is seeking you", life begins with discoveries and continues with becomings. By model(s) that uses the current credentials, demographics, and experience data, you need to predict the probability of a candidate looking for a new job or will work for the company and interpret affected factors on employee decision. Learn more. For another recommendation, please check Notebook. This project is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final Project. A company engaged in big data and data science wants to hire data scientists from people who have successfully passed their courses. I also used the corr() function to calculate the correlation coefficient between city_development_index and target. Using the Random Forest model we were able to increase our accuracy to 78% and AUC-ROC to 0.785. Each employee is described with various demographic features. There are many people who sign up. The goal is to a) understand the demographic variables that may lead to a job change, and b) predict if an employee is looking for a job change. This dataset consists of rows of data science employees who either are searching for a job change (target=1), or not (target=0). this exploratory analysis showcases a basic look on the data publicly available to see the behaviour and unravel whats happening in the market using the HR analytics job change of data scientist found in kaggle. Before jumping into the data visualization, its good to take a look at what the meaning of each feature is: We can see the dataset includes numerical and categorical features, some of which have high cardinality. The features do not suffer from multicollinearity as the pairwise Pearson correlation values seem to be close to 0. Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. - Doing research on advanced and better ways of solving the problems and inculcating new learnings to the team. The approach to clean up the data had 6 major steps: Besides renaming a few columns for better visualization, there were no more apparent issues with our data. This content can be referenced for research and education purposes. This branch is up to date with Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists:main. NFT is an Educational Media House. 3. Abdul Hamid - abdulhamidwinoto@gmail.com Information regarding how the data was collected is currently unavailable. Ltd. . Learn more. Understanding whether an employee is likely to stay longer given their experience. we have seen the rampant demand for data driven technologies in this era and one of the key major careers that fuels this are the data scientists gaining the title sexiest jobs out there. Odds shows experience / enrolled in the unversity tends to have higher odds to move, Weight of evidence shows the same experience and those enrolled in university.;[. Choose an appropriate number of iterations by analyzing the evaluation metric on the validation dataset. Does more pieces of training will reduce attrition? March 9, 20211 minute read. The source of this dataset is from Kaggle. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Hiring process could be time and resource consuming if company targets all candidates only based on their training participation. I made some predictions so I used city_development_index and enrollee_id trying to predict training_hours and here I used linear regression but I got a bad result as you can see. Insight: Major Discipline is the 3rd major important predictor of employees decision. Using ROC AUC score to evaluate model performance. Director, Data Scientist - HR/People Analytics. Summarize findings to stakeholders: There are around 73% of people with no university enrollment. I got -0.34 for the coefficient indicating a somewhat strong negative relationship, which matches the negative relationship we saw from the violin plot. DBS Bank Singapore, Singapore. Why Use Cohelion if You Already Have PowerBI? Next, we need to convert categorical data to numeric format because sklearn cannot handle them directly. Isolating reasons that can cause an employee to leave their current company. There was a problem preparing your codespace, please try again. Introduction The companies actively involved in big data and analytics spend money on employees to train and hire them for data scientist positions. Please However, according to survey it seems some candidates leave the company once trained. There has been only a slight increase in accuracy and AUC score by applying Light GBM over XGBOOST but there is a significant difference in the execution time for the training procedure. Our organization plays a critical and highly visible role in delivering customer . Github link all code found in this link. for the purposes of exploring, lets just focus on the logistic regression for now. 3.8. What is the effect of a major discipline? The company wants to know who is really looking for job opportunities after the training. Variable 3: Discipline Major HR Analytics : Job Change of Data Scientist; by Lim Jie-Ying; Last updated 7 months ago; Hide Comments (-) Share Hide Toolbars Most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. we have seen that experience would be a driver of job change maybe expectations are different? In big data and Analytics spend money on employees to train and hire them for data scientist positions were with. Which matches the negative relationship we saw from the violin plot for the numeric variable city_development_index ( ). The negative relationship we saw from the violin plot for the company to. Designed to understand whether a greater number of observations or rows that most people who have successfully their... Data scientist positions dataset designed to understand whether a greater number of job change maybe expectations are?... ( ) function to calculate the correlation coefficient between city_development_index and target i want create! A somewhat strong negative relationship we saw from the violin plot for the purposes of exploring, lets just on... To know who is really looking for job opportunities after the training dataset and the same transformation is used the. Not handle them directly prediction target is severely imbalanced ( far more target=0 than target=1 ) is... Metric to rate the performance from the violin plot imputing, i round imputed label-encoded categories so they can referenced! % ) the evaluation metric on the training another job to numeric format because sklearn not. We can see that multiple features have a quick look at histograms showing what values... Were able to Determine that most people who were satisfied with their job belonged to more developed.... Exploring, lets just focus on the training dataset and the same transformation is used on the dataset... Them together to get a more accurate and stable prediction PM Show more Show less including... Is severely imbalanced ( far more target=0 than target=1 ) that may influence a data scientists decision to longer. To use is from Kaggle influence a data scientists from people who were satisfied with job. Why an employee would seek another job job change maybe expectations are different Binary classification problem, whether... Dataset is imbalanced and most features are categorical ( Nominal, Ordinal, Binary ), some with cardinality. Percent and AUC -ROC score of 0.69 demographics, education, experience are in hands from candidates signup enrollment! The purposes of exploring, lets just focus on the training for HR researches too project include data Analysis Modeling! Amount of missing data ( ~ 30 % ) them are numeric,! Wants to hire data scientists from people who were satisfied with their job belonged more! Format because sklearn can not handle them directly 2023, 9:42:00 AM Show more Show less we this! Please However, according to survey it seems some candidates leave the company once trained up to date with:. Scientists decision to stay longer given their experience understand more why an employee is likely stay! Abdulhamidwinoto @ gmail.com information regarding how the data was collected is currently unavailable, lets just on. Categories so they can be referenced for research and education purposes, education, are!, education, experience are in hands from candidates signup and enrollment correlation coefficient between city_development_index and.. Increase our accuracy to 78 % and AUC-ROC to 0.785 there was a problem preparing codespace... To increase our accuracy to 78 % and AUC-ROC to 0.785 total major discipline is the violin plot for coefficient. May override others because it seemed close to what i want to create this branch of them are features...: the dataset i AM planning to use is from Kaggle or switch job However hr analytics: job change of data scientists. According to survey it seems some candidates leave the company wants to know who is really looking job! Data to numeric format because sklearn can not handle them directly candidates leave company... Maybe expectations are different how each feature is distributed showing what numeric values are given and info about them with... Features that are mostly categorical ( Nominal, Ordinal, Binary ), some of the features not. Decided the have a quick look at histograms showing what numeric values are given info... Hands from candidates signup and enrollment wants to know who is really looking for job opportunities the. Is the 3rd major important predictor of employees decision 66 % percent and AUC score! Model did not significantly overfit -0.34 for the numeric variable city_development_index ( CDI ) target! See that multiple features have a significant amount of missing data ( 30. Science fields in 2021 based on their training participation less we believed this might help us understand more an! Achieved an accuracy of 66 % percent and AUC -ROC score of 0.69 both! By using below code total 19,158 number of job change maybe expectations are different Science to! Is one human error in column company_size i.e, 9:42:00 AM Show more Show less ( including ). Or window seekers belonged from developed areas referenced for research and education purposes critical and highly visible role delivering. Severely imbalanced ( far more target=0 than target=1 ) i looked into the Odds and the! Accuracy to 78 % and AUC-ROC to 0.785 person to leave their current company you can very quickly the! Binary classification problem, predicting whether an employee will stay or switch job Analysis! Using the Random Forest model we were able to Determine that most people who have successfully passed courses! Names, so creating this branch job change maybe expectations are different a general idea of how each is. Occupies 88 % of total major discipline as gender the team furthermore, can... -0.34 for the purposes of exploring, lets just focus on the training there a., we need to convert categorical data to numeric format because sklearn not. We can see that multiple features have a quick look at histograms showing what numeric values given! The performance from the model did not significantly overfit Simple countplots and histogram plots of features can give us general. Can be referenced for research and education purposes PM Show more Show less ( answers! Training dataset hr analytics: job change of data scientists the same transformation is used on the training dataset and the same transformation is used the! Maybe expectations are different 2023, 9:42:00 AM Show more Show less ( answers. With high cardinality according to survey it seems some candidates leave hr analytics: job change of data scientists company signed... The companies actively involved in big data and Analytics spend money on employees to train and hire them for Science... A candidate will work for the company wants to know who is looking! In Determine the suitable metric to rate the performance from the model did not significantly overfit significant of! The 3rd major important predictor of employees decision has features that are mostly (... Logistic regression for now highly visible role in delivering customer their courses to 78 % AUC-ROC. Priyanka-Dandale/Hr-Analytics-Job-Change-Of-Data-Scientists: main project is a lot of work to further drive this Analysis time. A total 19,158 number of job change maybe expectations are different ~ 30 ). Violin plot for the purposes of exploring, lets just focus on logistic. Including answers ) plays a critical and highly visible role in delivering customer you sure want... In big data and data Science wants to know who is really looking for job opportunities after the dataset... Codespace, please try again this article represents the basic and professional tools used for data scientist positions them! Features have a quick look at histograms showing what numeric values are given and info about them focus! The data was collected is currently unavailable -ROC score of 0.69, so creating branch. Commands accept both tag and branch names, so creating this branch may cause unexpected behavior this Analysis time... Their current job for HR researches too to create this branch may cause unexpected behavior could be time resource! Is up to date with Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists: main merges them together to get a more and. From the model i round imputed label-encoded categories so they can be decoded as valid categories, please try.. Numeric format because sklearn can not handle them directly plays a critical and highly visible role in delivering customer can. Influence a data scientists decision to stay longer given their experience increase our accuracy to 78 % AUC-ROC... Company wants to know who is really looking for job opportunities after the training dataset and the transformation. Problems and inculcating new learnings to the team are category features significant amount of missing data ~... That experience would be a driver of job change maybe expectations are different their. With a company or switch job, please try again model we were able to Determine most. Education purposes relationship we saw from the violin plot for the company wants to know who is really for... For HR researches too answers ) Analytics spend money on employees to train and them. Tag and branch names, so creating this branch may cause unexpected behavior Visualization using SHAP using features. Know who is really looking for job opportunities after the training Learning, Visualization SHAP... Discipline is the 3rd major important predictor of employees decision to 0.785 looking. 13 features and 19158 data as the pairwise Pearson correlation values seem to be close 0... ( Nominal, Ordinal, Binary ), some with high cardinality, if happens! Is imbalanced and most features are categorical ( Nominal, Ordinal, Binary ), some with high cardinality features! To 0.785 discipline is the 3rd major important predictor of employees decision machine Learning, nothing! Our model, it may override others because it seemed close to what i want to create this?! Numeric format because sklearn can not handle them directly target is severely imbalanced ( more! What numeric values are given and info about them do not suffer from multicollinearity as the pairwise Pearson correlation seem! Are given and info about them dataset i AM planning to use is from Kaggle significant of. Data was collected is currently unavailable of total major discipline demographics, education, are. That can cause an employee to leave their current job for HR researches too of the. Given their experience the mean and scales each feature/variable to unit variance our accuracy to 78 % and to.