As the job market continues to evolve in 2024, staying on top of the latest job openings across multiple companies and industries can be a challenge. Building your own curated job board is an excellent way to aggregate relevant postings and help job seekers find their dream roles.
In this in-depth tutorial, we‘ll walk through how to leverage web scraping techniques and harness the power of ChatGPT to efficiently build a comprehensive job board. We‘ll cover best practices for finding company career pages, extracting job listings, and intelligently parsing key information from the scraped content.
Step 1: Generating a List of Companies to Scrape
The first step is finding relevant companies that have job listings you want to include in your job board. An efficient way to do this is by combining Google searches with specific site operators.
For example, if you want to find job postings for remote software engineering roles, you could use the following search query:
remote software engineer site:https://apply.workable.com/*
This will return a list of pages from the domain apply.workable.com that contain the keywords "remote software engineer". You can swap out the domain and keywords to match the types of roles and companies you‘re targeting.
To automate the process of extracting URLs from the Google search results, you can use a tool like ScrapingBee‘s Google Search API. Here‘s an example of how to retrieve the organic result URLs in Node.js:
import { ScrapingBeeClient } from ‘scrapingbee‘;
const client = new ScrapingBeeClient(‘YOUR_API_KEY‘);
const response = await client.get({
url: ‘https://www.google.com/search‘,
params: {
q: ‘remote software engineer site:https://apply.workable.com/*‘,
num: 100,
},
block_resources: true,
});
const urls = response.organic_results.map(result => result.url);
This will give you a list of URLs to career pages on workable.com that you can then scrape to extract individual job listings.
Step 2: Extracting Job Listings From Career Pages
Once you have the list of company career pages, the next step is to extract the URLs for the individual job listings. On most job boards, each listing will have its own dedicated page with a unique URL.
To find these, you‘ll need to load the HTML of the career page and parse it to find the relevant links. Many job boards will use common HTML patterns that you can target with CSS selectors.
However, one issue you may run into is location-based redirects or filters. Some sites try to automatically detect the user‘s region and filter the job results accordingly. To get the full unfiltered list, you may need to set your scraper location and manually disable any location filters.
Here‘s an example of how to extract job listing links with ScrapingBee:
const response = await client.get({
url: ‘https://apply.workable.com/company-name‘,
extract_rules: {
job_links: {
selector: ‘a.job-title‘,
type: ‘list‘,
output: ‘@href‘,
}
},
js_scenario: {
instructions: [
{
click: ‘.filters-reset-btn‘,
},
]
}
});
const jobLinks = response.job_links;
This uses CSS selectors to find all the job title links, extracts the URLs, and clicks the "reset filters" button to ensure all listings are shown. You can adapt the selectors and JS scenario to match the site you‘re scraping.
Step 3: Parsing Job Data with ChatGPT
Now that you have the individual job listing pages, you‘ll want to extract the key information from each one, such as:
- Job title and company name
- Location (remote, hybrid, or office)
- Job description and requirements
- Salary and benefits
- Application link
While some data like the title and company can be easily parsed from the page metadata or HTML, other fields like the full description text can be trickier.
Job descriptions often combine information about the role, candidate requirements, benefits, and company in freeform ways that are hard to consistently parse with rule-based approaches.
This is where ChatGPT comes in extremely handy. By prompting it to find and extract specific pieces of information from the raw job text, you can quickly parse the unstructured data without complex code.
Here‘s an example of how to extract the salary from a job description using the ChatGPT API:
const response = await fetch(‘https://api.openai.com/v1/completions‘, {
method: ‘POST‘,
headers: {
‘Content-Type‘: ‘application/json‘,
‘Authorization‘: Bearer ${OPENAI_API_KEY}
,
},
body: JSON.stringify({
model: ‘text-davinci-003‘,
prompt: `Extract the salary from the following job description into a JavaScript variable:
${jobDescription}
`,
max_tokens: 50,
temperature: 0,
})
});
const output = await response.json();
const salary = output.choices[0].text.trim();
This sends the job description text to the ChatGPT API with a prompt to extract the salary into a JS variable. We can then parse that variable string in our code to get the actual salary value.
You can repeat this for extracting other fields like benefits, requirements, etc. The key is to provide clear instructions in the prompt so the model knows exactly what to look for.
Keep in mind that the ChatGPT API costs money based on tokens, which are units of text. Longer job descriptions will require more tokens and increase the cost. Look for ways to minimize the input text and output size.
Caveats and Considerations
While ChatGPT is incredibly powerful, it‘s not magic. The output can sometimes be inconsistent or incorrect if the job description is vague or contains conflicting information. It‘s important to have validation and fallback logic to handle cases where the model fails to find the requested data.
Also be aware that job boards change over time. Websites may update their templates, URL structures, or blocking practices in ways that break your scraper. It‘s a good idea to periodically test your scraping logic and update it as needed.
Finally, be sure to respect the terms of service of any site you scrape. Many job boards allow scrapers, but some may have restrictions. Use rotating proxies, caching, and rate limiting to avoid overloading servers.
The Final Job Feed
By combining a targeted company list generated from Google, complete job listing links extracted from career pages, and key job details intelligently parsed with ChatGPT, you can create a comprehensive, structured job feed to power your custom job board.
As the job market continues to move online, these techniques will only become more valuable for aggregating and curating job data at scale. By learning the fundamentals now, you‘ll be well-positioned to adapt as the web scraping and AI landscape evolves.