Google API WebCrawler

It all started when...

We received a file of accounts that appeared to be normal. The account had correct addresses, names, and everything else important. However, we could not locate any numbers (even though we were positive that this location was correct) for the account! So how do we gather more information on an account without using some of the major leaders in information? The web was the best place to start.

So, where on the web did we start? Anyone can google a name/address/ect and find that information. However, what about that data buried within pages of subpages of the internet? Enter the GoogleApi WebCrawler project.

This little application is based off of Google's API library of various different API's (Map/Natural Language/YouTube/Google+/ect). First, we had to decide in what framework & programming language we would build it in. With so many different options on how to build this out, we decided on a simple C# .Net Framework application. So, how did we implement this project you ask?

Once we decided on the framework, we set off to create a "all-in-one solution" for our crawler. First, we started with a bit of code:

using Google.Apis.Customsearch.v1;
using Google.Apis.Customsearch.v1.Data;
using Google.Apis.Services;

These were key in developing out how we were going to gather this data. Without these libraries provided by google (Thanks Google!) we would had to spend a significant amount of time developing out our own libraries to read each page and its meta data to determine a correct page! Once we were able to package the library, we moved on to creating the code to handle a CustomService request to google like so:

private static string API_KEY = "xxxx";

//The custom search engine identifier
private static string cx = "XXX";

public static CustomsearchService Service = new CustomsearchService(
new BaseClientService.Initializer
{
ApplicationName = "XXX",
ApiKey = API_KEY,
});

From here, we did a bit of magic inside both the code and the Custom Search API to break down our results of what we deemed as acceptable sources! We eventually piled it all together into a small query search string in this function :

string query = queryObj.Text;
var results = Search(query);
foreach (Result result in results)
{
Console.WriteLine("-----------------------------------");
Console.WriteLine("Title: {0}", result.Title);
Console.WriteLine("Link: {0}", result.Link);
}
Console.ReadKey();

Wolla! We have results for our search! So, now that we have the data for each page that MAY contain the account/address we are looking for... what next? We then look at our records internally to see if we have any matches, then we look externally on what this data is associated to (with our google api again!) to see if we have a match. Once we find a match, we have this information ready for our account experts to make informed decisions on the validity of the data!

So there you have it, with a bit of coding and help from google, we can match accounts almost perfectly to find information on accounts we normally would never be able to find using conventional methods!

Follow us on LinkedIn to find out other interesting topics from Joseph, Mann & Creed!