Job Change Analysis in Python

This is an indiviual project of mine in Jupyter Notebook with the use of Python. It ivolves the preprocessing of a dataset and using methods of feature engineering. The dataset is of the job changes of data scientists. I use descriptive analytics, inferential analytics, supervised learning, and linear regression in this project.

Data Quality

This DataSet is interesting to me as it involves insights into the Job changes of Data Scientists. It is a dataset designed to understand the factors which affect a person to leave their job. It helps the company to predict the probability of an employee leaving the company based on the credentials, demographics, and experience data. It helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates.With target meaning 0 – Not looking for job change, 1 – Looking for a job change.

Data Preprocessing

For this section, I identified the null values in the Dataset. Through careful consideration I decided to remove certain columns due to their great number of null values, so as not to lose these rows. I then removed null values from the remaining columns. I analysed the data types of the columns and converted categorical values to numerical values. I used techniques for interval and ordinal categorical variables, using label encoding along with one-hot encoding. I also removed text from some calumn values and converted them to numerical values. Finally I standardised the data to fit between 0 and 2.

Feature Engineering

For this section, I used Supervised Machine Learning Techniques. As I wanted to create a model to predict if a data scientist will change job, I split the data into a dataframe with and without the prediction column. I then split the data into test and train sets, with 70% in the train set and 30% in the test set. This is to eliminate overfitting of our model to the data. I then tested the accuracy of the model using the super vector model (SVM), RandomForestClassifier and Linear Regression.

Click here for the dataset and code.