On 2013-01-08 17:15, Brian Hechinger wrote:
On 1/8/2013 9:24 AM, Johnny Billquist wrote:
Data mining is difficult, since there are different systems, with
different possibilities of extracting it, and in different formats.
Sampsa's goal here is mapping HECnet. My goal is to write a data mining
service that just happens to provide the data that a mapper would use.
To that end, I'll be stashing all this data in a database of some sort
(details unknown as yet, see below)
Right. So, once we have the data, it can be used by other people in various ways. So lets
focus on the data.
A centralized repository of data is nice in many ways, but it is a
headache to manage.
Absolutely it is, but if people are already putting INFO.TXT files out
there they are doing 99% of the work already, we just need to get the
data in a single place.
I think it's better to keep the information separate. INFO.TXT was created for one
purpose, this is trying to reuse it for another, additional purpose. Better to create a
separate file. In addition, we can make some better design choices as we go about this.
That said, I could be convinced of setting something semi-automatic
up. A reasonable way would be for people to give me machines to poll,
and then I'd setup an automated process to poll those machines for
files in a specific format. I can then create a database out of that,
and make it available through the web, as well as over DECnet, and
also as a summarized file. Anything would be pretty easy if we just
have the data collected.
I think this would be fantastic.
Ok. So let's set about working on this. Anyone else want to join? We should probably
keep this off list, as it will be rather technical, and present a solution when we have
it.
I already have something of a start for this in the form of my
database of nodes in HECnet. I'd need to extend it with more fields,
but that would be pretty easy. It's all in Datatrieve today, and that
should be accessible over DECnet right now (even though I seem to
remember that VMS hosts had some problems with that).
I'll have to learn how to access that db.
It should be trivial. Datatrieve have a programming interface. Should be callable from any
language.
I'm already extracting information from that database for the hecnet
web-page on MIM (accessible as Madame).
So, if we can just decide on what we want, and how to make the
information available, I'll sit down and write the code to fix it.
Do you want to also store my data or should I do that myself? I might do
it myself at least for now until I know what exactly I need/want to save.
Let's start by talking exactly what we want to store, and how to retrieve it, and
possible uses of it, to make it meaningful. We can then go about how and where to store
it. I even think that it wouldn't be a problem to store it in several places, scrape
the source from several places, and present data and services based on this from several
places.
Johnny
Show replies by date