Visualizing software development skills with embeddings

Introduction

Folq is a Norwegian company that matches software development consultants with projects. If your company needs a database expert for a project, then you can probably find a suitable candidate on the Folq platform.

One of the ways the developers describe themselves is by using binary (yes/no) tags for skills and roles. There are around 300 skills and 50 roles available to choose from.

  • Examples of skills are: JavaScript, SQL, Scrum, Docker, Agile, DevOps, Linux, Prototyping, …
  • Examples of roles are: Scrummaster, Tech Lead, Data Scientist, Backendutvikler, Analytiker, …

Some skills and roles are Norwegian words, but non-Norwegian readers will likely understand most of them.

We can think of the data as a binary matrix, where each row indicates a person and each column a skill (or role). In the figure below the first and second person are “similar” to each other, since they have almost the same set of skills. The third person is dissimilar to the others, and the “opposite” of the first person in a sense.

In this article we’ll learn an embedding of the data; a two-dimensional representation. The embedding will concretize notions of similarity between persons, skills and roles. We’ll be able to use the embedding (1) as a tool to visualize and analyze the data, (2) as a recommendation algorithm and (3) as a predictive model.

Results

We’ll start with the results. Later we’ll present the mathematical model and data in more detail.

A subset of skills and roles

Here is a visualization of a subset of skills and roles (click on the image to enlarge).

The distances between the roles seem reasonable. On the bottom left we see developers, on the right graphic designers, at the top project leaders. Fullstack is between frontend and backend, and frontend developers are closer to graphical designers than any other type of developer.

The distances between roles and skills are also satisfying. For instance, at the bottom right we see graphic designers, and the nearby skills are UX-design, Adobe and Figma. This makes perfect sense, and other such groupings are observed elsewhere in the figure too.

All roles

The figure below shows every role. Skills are shown in the background, but are not labeled. Both skills and roles have been scaled indicate their popularity.

The interpretation of closeness is roughly that if a role and skill are close to each other, then the probability that they appear together in the dataset is high.

The backend, fullstack, frontend and mobile developer roles at the bottom of the figure support a lot of skills!

Subsets of skills

There are around 300 skills in total, so plotting and labeling all of them in a single figure is infeasible. Below are 60 popular skills shown in a single figure. The data points are scaled to indicate the frequency of each skill.

Instead of showing the most popular skills, we show a diverse set of skills below:

Using the embedding as a recommender

Every skill (and role) is assigned a point in two-dimensional space. Each software developer is also mapped to a point in two-dimensional space, as shown in the figure below.

The problem of embedding persons and skills is very similar to embedding users and movies. In fact, the model we use is very similar to the matrix factorization techniques that were used to win the Netflix prize, which was a competition for the best recommendation algorithm. The Netflix problem boils down to predicting user ratings of movies.

Instead of predicting the user ratings of movies, we predict the probabilities of tags. Our embedding can be used as a recommender— the skills and roles that are close to me in embedding space are recommendations for me.

I have a profile on folq.no, so I am in the dataset. Here are the top 10 skills recommended to me:

  1. Hugging Face
  2. Natural Language Processing (NLP)
  3. TIBER
  4. AWS sagemaker
  5. Prompt engineering
  6. Algoritmer og datastrukturer
  7. Large Language Models
  8. Large Language Models (LLM)
  9. Kanban
  10. PyTorch

It’s not a perfect recommendation, but it’s not bad either. The model used in this article was written in PyTorch, so I should add it to my skills!

The data

Here’s some information about the data from Folq used in this article:

  • There are approximately 2200 developers, 300 skills and 50 roles.
  • The median developer has chosen 25 skills. Most developers have chosen between 15 and 35 skills, as measured by the interquartile range.
  • The median developer has chosen 4 roles. Most developers have chosen between 3 and 5 roles.

If we structure the 2200 developers and 300 skills into a matrix, most entries would be zero. The number of non-zeros is 9%. The matrix is quite sparse, but not nearly as sparse as e.g. a matrix consisting of movies and users. Similarly, if we structure 2200 developers and 50 roles into a matrix, the number of non-zeros is also 9%.

The model

As is often the case in applied math, there is no need to reinvent the wheel—we can combine a few well known ideas to solve our problem. We enhance a matrix factorization model (as seen in recommender systems) with a binary cross-entropy loss function (as seen in logistic regression).

The model is constructed as follows:

\begin{align*} y_{ij} &\sim \text{Bernoulli}(\mu_{ij}) \\ \mu_{ij} &= \text{sigmoid}(\eta_{ij}) \\ \eta_{ij} &= c + b_i + b_j + \boldsymbol{v}_i^T \boldsymbol{v}_j \end{align*}

Here \(y_{ij}\) is a binary variable indicating if developer \(i\) has tagged skill/role \(j\) on their profile. You may think of this like a coin flip, whose probability is governed by \(\mu_{ij}\).

The parameter \(\mu_{ij}\) is a probability, so we must force it to be between zero and one. To squish the linear prediction \(\eta_{ij}\), we apply a a sigmoid function \(\text{sigmoid}(\eta) = 1 /(1 + \exp(-\eta))\) to it. The parameter \(c\) is a global bias, \(b_i\) is a bias term for each developer \(i\) and \(b_j\) is a bias term for each skill/role \(j\). The parameters \(\boldsymbol{v}_i\) and \(\boldsymbol{v}_j\) are embedding vectors.

The parameters \(c\), \(b_i\), \(b_j\), \(\boldsymbol{v}_i\) and \(\boldsymbol{v}_j\) are all learned from the data. We use two dimensional embedding vectors \(\boldsymbol{v} \in \mathbb{R}^2\) so we can visualize the results, but in theory nothing stops us from using a higher-dimensional embedding space.

The log-likelihood that we maximize using gradient descent with PyTorch is

\begin{align*} L(y, \mu) = y \ln(\mu) + (1 - y) \ln( 1 - \mu), \end{align*}

and this loss is implemented using BCEWithLogitsLoss. The figure below shows how loss decreases as we train the model. Notice how the model does not really overfit, since there are few parameters to learn compared to the number of observations. We abort model training when the validation loss starts to increase (it’s not visible in the figure, but it increases ever so slightly).

Performance and the dimensionality of the embedding space. Running Adam with a learning rate of \(10^{-3}\), we obtain the following validation set losses as we vary the embedding dimension.

Embedding dimension   Validation loss
       0                   0.237    
       1                   0.206    
       2                   0.187    
       4                   0.179    
       8                   0.171    
       16                   0.167    
       32                   0.165    

We observe that embeddings of higher dimensionality lead to lower loss and more performant models. In other words, the skill-space is not really two-dimensional, but we have to squeeze it down to two dimensions to visualize it.

Some other comments on the model and training:

  • Skills and roles are two distinct fields in the dataset, but we treat them equally in the model. Training on skill and roles in the same model puts them in the correct positions with respect to each other in the same embedding space.
  • With approximately 2200 developers, 300 skills and 50 roles, the model had around 770 000 data points to train on.
  • One epoch (full run through the data as the model learns) takes around 10 seconds on my computer. Depending on learning rate and desired quality, a full model is trained in five to 50 epochs.
  • The model has one bias term and one two-dimensional vector for each developer and skill. In total around \(350 \times 3 + 2200 \times 3 = 7650\) parameters.
  • It’s important to initialize parameters randomly around zero with little variance, so as not to saturate the logits and probability estimates.

Summary and references

Our dataset is essentially a large, sparse binary matrix of people and their skills. The method sketched in this article is not restricted to visualizing binary yes/no tags, or skills for that matter. One could analyze and create embeddings for similar data for e.g. movies/users/ratings, songs/users/likes, people/events/attendance, videos/users/thumbs-up, and much more. I previously analyzed a political questionnaire using the SVD, which is similar to the approach outlined in this article.

We chose a matrix factorization model because of its clear interpretation, task specificity and the dual roles played by developers and skills:

  • We can clearly understand what vector products do in the embedding space.
  • We optimize directly for predicting whether a person has a skill, and high-quality embeddings are produced as a side product.
  • Developers and skills can be visualized in the same embedding space.

There are other ways to visualize and understand this data: t-SNE, MDS, hierarchical clustering or association rule learning. The choice of specific method and algorithms depends on exactly what one wants to achieve with the analysis.