Machine Learning Help

hey guys. I am starting to learn how machine learning works and what can be done with it.
So I am working on a project- I scrapped a companies vacant jobs from and I am trying to map the job titles to the industry it is related to. For example - I scraped ebay's jobs and I am trying to categorize engineer, java engineer and mechanical engineer under engineering. I managed to scrape all the data from indeed and have entered it in a data frame. It is unlabelled - the data frame only consists of the company name and the job title. I cleaned up the data and everything.
What I am having problems with is understanding how deep learning is used in R to map this. I have been reading a lot about how it can be used to map everything but I cannot seem to find a tutorial/ method to actually use deep learning in Rstudio. Can someone help me or direct me to the right place?
I assume I have to use unsupervised learning to label it and create a test and training data subset to do it but I just don't know how to use it. I've been googling about it a lot but no one really gives a tutorial on how to use it but just has code on what they did which isn't helpful as it can be confusing for someone who is learning from scratch.

I had an idea which I don't know if it is helpful or not. Since I did a quick indeed search - for example, I searched engineering and scrapped all of those job titles and put it in a data frame and did the same for customer service, sales, and marketing, business operations, leadership. I thought maybe this data frame could help for deep learning as it is labeled and it could find similarities between job titles and categories and be able to use that information to map it to the master dataframe of Ebay's job titles which are unlabelled if that makes sense. I don't know if this is a good idea or a valid idea so some help would be appreciated there too

Hi. Coupla things

  1. See the tidytext package and book for a good way to process the text.
  2. Given the data, what unknown do you want to estimate using ML?

For a good walk through of machine learning in R at an introductory level, see Chapter 32 of @rafalab's text for HarvardX datascience

Hey, thanks for replying

I have a pretty good understanding of R. Its just that I have never used Machine learning on R and that is what is confusing me.

So I have a dataframe of the job titles and the company where the job is available. I am trying to map which industry the job is in. With ML I'm trying to predict the industry the job is in

1 Like

What you're trying to do with the data that you have doesn't sound like the right fit for machine learning:

If I'm understanding you correctly, you have a data frame of Job Title and Company, and you want to associate each posting (row) with an Industry. This might be a reasonable supervised problem to tackle if you had industry labels, and other information about each post (e.g. salary, years experience required, basically other columns in your data frame that described the job posting); specifically you'd be looking for some sort of classification algorithm. Alternatively if you had the other data, but not the industry label, you could use unsupervised learning (some sort of clustering algorithm) to find groups in the data that might naturally separate your data by industry (e.g. if jobs in a certain industry were typically similar in terms of job title and your other data).

As it stands now, if you're really interested in understanding the industry of each job post, you'd probably be better off doing some sort of look up of the company name to get the Industry. (E.g. Wikipedia has an industry section for major companies, here's an example:

If I were you I'd start with some introductory books to get a handle on what machine learning is, and what it can do.

From a "classic" machine learning perspective I think that "An Introduction to Statistical Learning" would be a good place to start.

"Deep Learning with R" is a good primer for getting started with deep learning (in R).

You could also consider taking an online course, I found the Andrew Ng one on Coursera to be a great start for me.

1 Like

Well, as @jim89 noted, without a mapping of company to industry mapping job description to industry isn't going to be in the cards.

What could could do is to use a NLP semantic clustering method on the descriptions and then do a KNN clustering of companies and then do the lookup to see how well it does.

Hey, a lot of that makes sense.
So what I am trying to do for example is find all of ebay's job postings on indeed. I scrapped all the job titles from the company webpage on indeed and I made a data frame (Call this X). So within those 150 there are multiple sectors the jobs are in. I am trying to categroize it in 6 categories which are Engineering, Customer Service, Business Operations, Leadership, Sales and Marketing and other.

What I also did was did a quick indeed search for each of the categories. For example I searched Engineering and scrapped the first 50 jobs that came up and labelled those. I did that for the other 4 industries except for "Other". In this table, the job titles is labelled with the industry its in (Call this Y). This table is 250 rows long and I was somehow hoping I could use machine learning on table Y and form a model to implement on table X. I want to form the model because I am trying to predict the industry for job postings for a 100 companies so it's gotta be scalar if that makes sense

Without a reproducible example, called a reprex, my answer is necessarily general.

Your table Y has two variables, which I'll call title and industry. Formally, you want to know E(title | industry), the expectation of title given industry. You can of course count each combination and then do an ANOVA or other form of contingency table to see how far that gets you. A simple t.test might show so little association as to make Y unhelpful.

But let's assume that there is some substantial nexus between the two. Can you model it? Since the outcome variable Y is multinomial categorical and the industry covariates X_i ... X_n are, also, you can't do OLS. (Well, you can, sort of, but only econometricians go there.) Which means logistic regression, for which you'll need to create a lot of dummy binary variables to represent whether a given title observation is classified as in an industry.

That would give you a usable model to apply to table X, but the way described your unknown is not industry but job function. Before much progress is likely, the best advice I can offer is to go back to defining the problem more closely.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.