Tech Minds: Analyzing Mental Health Trends in Technology Professionals

Tech Minds: Analyzing Mental Health Trends in Technology Professionals

PythonPandasScikit-learn
2025-03-05

A comprehensive analysis of the 2014 Mental Health in Tech Survey using Python, Pandas, and Scikit-Learn to uncover employer attitudes and predict treatment-seeking behavior.

The technology industry is known for its innovation and fast-paced environment, but it also presents unique challenges regarding employee well-being. In this post, we dive into the "Tech Minds" project, an analysis based on the 2014 Mental Health in Tech Survey. We will explore employer attitudes toward mental health, the prevalence of disorders among tech workers, and build predictive models to determine who is likely to seek treatment.

1. Business & Data Understanding

The primary goal of this analysis is to measure the pulse of mental health in the tech workplace. Key questions we aim to answer include:

  • Do employers recognize the importance of mental health?
  • How frequently does mental health impact work performance?
  • How do demographics like Age and Gender influence the likelihood of seeking treatment?

The dataset consists of 1,259 responses with features ranging from self-employment status and family history to employer benefits and anonymity protections.

2. Data Processing and Cleaning

Real-world survey data is rarely clean. Our first step involves handling missing values, outliers, and inconsistent categorical data.

Cleaning the Age Column

The Age column contained substantial noise, including impossible values (e.g., -1726, 329, 99999999999).

# Removing values which can't be changed or are clearly erroneous
values = [-1726, 329, 99999999999, -1, -29]
for val in values:
  data = data[data.Age != val]

Standardizing Gender

One of the most significant challenges in this dataset was the Gender column, which contained a wide variety of free-text entries (e.g., "Male", "m", "Male-ish", "Cis Female"). We standardized these into three categories: male, female, and trans.

# Making gender-groups to classify all values
male_str = ["male", "m", "male-ish", "maile", "mal", "male (cis)", "make", "male ", "man","msle", "mail", "malr","cis man", "Cis Male", "cis male"]
trans_str = ["trans-female", "something kinda male?", "queer/she/they", "non-binary","nah", "all", "enby", "fluid", "genderqueer", "androgyne", "agender", "male leaning androgynous", "guy (-ish) ^_^", "trans woman", "neuter", "female (trans)", "queer", "ostensibly male, unsure what that really means"]
female_str = ["cis female", "f", "female", "woman",  "femake", "female ","cis-female/femme", "female (cis)", "femail"]

# Changing groups into three categories
for index, row in data.iterrows():
    if row['Gender'] in male_str:
        data.at[index, 'Gender'] = 'male'
    elif row['Gender'] in female_str:
        data.at[index, 'Gender'] = 'female'
    elif row['Gender'] in trans_str:
        data.at[index, 'Gender'] = 'trans'

# Getting rid of unnecessary values
stk_list = ['a little about you', 'p']
data = data[~data['Gender'].isin(stk_list)]

Handling Null Values

We addressed missing data in the self_employed and work_interfere columns by assigning default values.

# Assigning default values for columns with missing values
defaultString = 'NaN'

# Replacing NaN in self_employed with 'No'
data['self_employed'] = data['self_employed'].replace([defaultString], 'No')

# Replacing NaN in work_interfere with "Don't know"
data['work_interfere'] = data['work_interfere'].replace([defaultString], 'Don\'t know' )

3. Exploratory Data Analysis (EDA)

With clean data, we visualized key trends using seaborn and plotly. Mental Health Benefits One of the first questions we asked was: Does your employer provide mental health benefits?

import plotly.graph_objects as go

colors = ['gold', 'mediumturquoise', 'darkorange']
labels = ["Yes","Don't know","No"]
values = [477, 408, 374]

fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='label+percent')])
fig.update_traces(title='Does your employer provides mental health benefits?', 
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.show()

Observation: While a plurality of employers provide benefits, a massive portion (over 30%) of employees simply don't know if they have coverage, highlighting a communication gap.

Treatment Probability by Gender

We analyzed the likelihood of seeking treatment based on gender identity.

plt.figure(figsize=(12,8))
# labelDict contains mapped values from LabelEncoder
labels = labelDict['label_Gender'] 
g = sns.countplot(x="treatment", data=data)
g.set_xticklabels(labels)
plt.title('Total Distribuition by Treated or Not')
plt.show()

Observation: Transgender individuals showed a higher probability of seeking treatment compared to other groups, though the sample size for this group was smaller.

Age Distribution

We also examined the age distribution of the respondents to understand the demographics of the survey.

plt.figure(figsize=(12,8))
sns.displot(data["Age"], bins=24)
plt.title("Distribuition and Density by Age")
plt.xlabel("Age")
plt.show()

Observation: The majority of respondents are in their late 20s and early 30s, consistent with the general demographics of the tech industry.

4. Predictive Modeling

We aimed to predict the treatment variable (whether a person has sought treatment for a mental health condition). We tested five machine learning models:

  • Logistic Regression
  • k-Nearest Neighbors (KNN)
  • Random Forest
  • Boosting (AdaBoost)
  • Bagging

Feature Importance

Using an ExtraTreesClassifier, we determined which features contributed most to the prediction.

from sklearn.ensemble import ExtraTreesClassifier

def feature_importance(X, y, feature_cols):
  forest = ExtraTreesClassifier(n_estimators=250, random_state=0)
  forest.fit(X, y)
  importance = forest.feature_importances_
  indices = np.argsort(importance)[::-1]

Key Drivers: work_interfere, family_history, and care_options were consistently top predictors.

5. Model Evaluation

We split the data (70% train, 30% test) and evaluated models using Accuracy and AUC scores. Here is an example of the Random Forest implementation, which was tuned using RandomizedSearchCV.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

def randomForest(X_train, y_train, X_test, y_test, feature_cols, X, y):
    # Parameter tuning logic...
    
    # Building and fitting
    forest = RandomForestClassifier(max_depth = None, min_samples_leaf=8, min_samples_split=2, n_estimators = 20, random_state = 1)
    my_forest = forest.fit(X_train, y_train)

    # Making class predictions
    y_pred_class = my_forest.predict(X_test)

    accuracy_score = evaluate_model(my_forest, y_test, y_pred_class, X, y)

Results

The models achieved the following accuracy scores on the test set:

  • Logistic Regression: ~81.4%
  • KNN: ~81.7%
  • Random Forest: ~83.2%
  • Boosting: ~82.2%
  • Bagging: ~77.9%

The Random Forest Classifier proved to be the most effective model for this dataset, achieving an accuracy of approximately 83% with a robust cross-validated AUC score.

Conclusion

By cleaning and integrating the Mental Health in Tech Survey data, performing extensive EDA, and building predictive models, we moved from raw data to actionable insights.

Key Takeaways:

  • Mental Health is a Priority: Employers recognize the importance of mental health, but there is still a communication gap.
  • Demographics Matter: Age and Gender influence the likelihood of seeking treatment.
  • Machine Learning Works: The Random Forest Classifier proved to be the most effective model for this dataset, achieving an accuracy of approximately 83% with a robust cross-validated AUC score.