Food, Environment and the InterWeb: July 2011

Wednesday, July 13, 2011

Update on the Webbot/CA/SNA Software... (CyberMapper)

Ok, so I came up with a name for the webbot project - "CyberMapper." Not sure if this is already in use, or whether it will stick, but for ease I had to come up with something. So there it is, and the CyberMapper project is getting closer to a beta release. For now, it is being restricted to academic use only, there are already plenty of solutions that cater to

To recap from previous posts, this is an application being developed to assist my research efforts here at WSU. CyberMapper is a web application written in PHP, and backed by a MySQL DB. It currently runs locally on my computer, and lacks a suitable UI, so all command line. But, I'm planning to get it running live, and with a slick UI (maybe pseudo-slick).

OK, so what does CyberMapper do? -

The application is designed to retrieve a Google, or Google Blog search query-set (News to be added soon). The software then extracts the links, descriptions and site-title information. This is uploaded into a db. The initial search provides up to 1000 urls that serve to seed a more in-depth web crawl. CyberMapper then initiates the crawl collecting all text/images/multimedia data contained within the site, parses and extracts all outbound links, and saves to the db.

The user can define the search depth to conduct, meaning I can run a "search and save" crawl that goes one step away from the seed site or more. For example, if you set the distance of the search to 4, the search will collect pages 4 links away from the seed site. The user can also define the number of seed sites (provided via the google search results) to use when conducting a crawl. I suggest keeping these numbers small because this could be disastrous unless you are capable of handling terabytes of data.

In addition to what the software does, I'm also trying to develop a text analysis tool to then process the parsed content of the sites. Of course, there are software tools already available, but this simple tool could be useful in choosing the data collection to upload in a package like Atlas.ti. Of course the data collected from each of the crawls can be exported to an Excel workbook, and in my case, I'm using Stata 10.

But, will I meet my own July 15th deadline? Probably not, we are in the middle of moving, and other work responsibilities are calling. But I'm close to begin sharing this project with others, yeah!