This internship contained my first attempt at data-mining a large corpus of information. The long and the short of it is, I am not a natural data-miner.
Allow me to expand.
One of the first lessons undergraduates learn about essay writing, is that having too little to write is far more crushing than having too much. Having too much to say and not enough room to say it in is immensely frustrating, (and if truth be told, it can sometimes feel as if “well why did I bother to research it then?”). On the other hand the experience of not having enough to say contains all of this, along with the bitter sting of inadequacy. However this project was the first time I ever had so much data that I was paralysed by it.
Instead of trying and experimenting with the spreadsheets, documents and personal notes, the team had provided me with, I became immobile. There just seemed to be so much to tackle, and so many avenues within the information that I could potentially explore I struggled to find a starting point. Since I had so little experience in this area, I kept switching between tools and techniques, afraid I was missing something, or doing something in an unnecessarily difficult way. This meant that it was a challenge to get a consistent work flow. This blog post will look to examine the elements that lead to this issue, and the skills I learned from the experience.
Part of the problem was how singular the work of another seemed. In a previous post, I discussed how unnatural it felt to be chopping up another’s work, deciding how it should be stylistically moulded and what was or was not vital to include. This sense of intruding on the work of another was part of the same fear I had data-mining, and not just some other student, but the work of individuals who are expert in their field, and had been good enough to give me access to their process and results. The combination of workload and feeling of inadequacy kept me from the data for far longer than it should of, and as such, I am left with the shaming knowledge that I probably didn’t do as much with it as I could have. Although this was not disastrous, since I requested getting some experience with data-mining, and it was not a key role in my internship, it was a personal disappointment.
The information I was working with
It might be useful to give some idea of the scale and content of what I was working with. The overall collection is comprised of work from five different libraries, with thousands of books in each collection, the total number of which grew up until the eleventh hour. All of this information was broken down into different sections including title, display title, first, third and secondary authors, first, third and secondary languages, graffiti etc. Within all this I had to discern some pattern or interesting wave to translate into an engaging visualisation, which the team would also find intriguing.
To say I was overwhelmed is putting it lightly.
In addition to struggling with creating and implementing a strategy for working with Big Data, I also got wrapped up in person interests. For instance, the presence of certain authors or exploring the markings left by previous owners. At first these digressions seemed a way into analysing the data, however, eventually they became more of a hindrance than a help.
The biggest struggle I had while data mining, was focusing on the most pertinent information and not over-burdening myself with work. The data I was specifically looking for was to do with language, gender and patterns of the most prolific authors work within the catalogue. However, when I finally did dive in, I kept it simple. My general strategy for this work has been to get the simple stuff right, and build from there. Then at least myself and the team would walk away with something, me with some experience of working with Big Data and them with some basic infographics. I kept this idea in mind. In addition to my own fears, there was the knowledge that the database designer was working with the data, and had more training and hands on experience than me.
What happened in the end?
As time began to run short I reached out to other individuals on my course who had previous experience with Big Data and data mining. At first they were full of intricate and dynamic solutions and tools to try… and then they heard how long I had left to complete the project. At which point they laughed and said “forget it…you don’t have long enough.” So I was left to trundle along as best I could. However, they did give me tips on best practice and were useful to bounce ideas of the most vital information to retrieve.
This simplistic approach worked best for the infographics that I was creating. I saved a list of basic questions to answer in a Word doc, and when I had responses I consolidated them into simple two lined answers. These were the questions I began to answer, with tools such as Datahero, which while not strictly a data-mining software, was able to produce high quality charts and graphs of data, many of which were included in the final infographics.
In hindsight the biggest mistake I made was trying to do everything at once, which meant I succeeded in doing very little of value (initially). It wasn’t until I began slicing the data up more confidently and in a way that made sense to me that I made any headway with it. In terms of project management, I learned the importance of being assertive with yourself, and honest about when you need more help or guidance from more experienced members on your team. And of course, not let to let yourself get bullied by Big Data.