I’ve spent much of the past two weeks messing about with different ways to reduce down over 200,000 bubbles (now almost 220,000) into a sensible catalogue. This gets very messy so I will try and explain what I’ve been up to in stages. This is a process called data reduction and for a citizen science, crowd-sourced project like the MWP, it can get complicated. I thought it may interested some of you to see where we currently are in the process of turning your clicks into results.
The key part of the data reduction problem is that we have a very large set of data – the massive number of bubbles that have been drawn – and need to decide which among them are ‘similar’ to each other. We need to keep some flexibility of our definition of similarity because right now, I’m not sure what ‘similar’ means.
Essentially, bubbles are ‘similar’ when two people draw a similarly sized bubble in a similar location. This is something that sounds remarkably easy to say but was hard to do well in code. Comparing 200,000 bubbles to each other is obviously computationally intensive.
In the end I decided that since the size of bubbles was a consideration then I would move across the galaxy, looking on ever-decreasing orders of size. To do this I split the galaxy into 2×2 degree boxes and take each box in turn. In each box I see if there are bubbles here that are of the order of the size of the box (meaning they have a maximum diameter that is between a half- and a whole-box). If there are bubbles on that scale I run a clustering algorithm and pick out groups of these bubbles with central positions clustered to within one quarter of the box size. If a cluster is found, those bubbles are then saved and removed from the whole list. I then divide the box into four and repeat until no bubble are found.
This method means that when a box contains no bubbles, we need not continue down in size scale, but when it does contain bubbles we always split and inspect the four child boxes. In this way we move through the galaxy, in ever-decreasing boxes, but in a fairly efficient manner.
We also have to perform the same analysis with an offset grid. This is exactly the same but making sure we catch bubbles that had fallen on the borders of boxes.
Once we have passed across the galaxy on all size scales, we need to make sure we’ve cleaned up the duplicates created by the offset grid. We do this by considering our newly created list of ‘clean’ bubbles and running through them in order of size. When we find bubbles of a similar size and location they are combined, according to the number of users that drew that bubble. This can be done more easily now that there are far fewer bubbles (in my tests we have dropped to around 5% of the initial number by this stage).
My initial run only looked at bubbles in the longitude range 0-30 degrees. Below are three images, showing one image from the MWP set (one of my favourites as lots of people see it differently). You can the the image, as it is shown to MWP users. Below that you see, overlaid in blue, the original bubbles as drawn by the users. In the third image you can see the same, but this time displaying the ‘cleaned’ results. In the original set the bubbles all have the same opacity, such that when they pile up you can see the similarities. The cleaned set gives the bubbles opacities according to their scores (think more opaque bubbles mean more users drew them).
It should be noted that the cleaned image does not yet display arcs, but rather always shows an entire ellipse. This is because I am not yet including the bubble cut-outs (which you can make out in the middle image) in the data reduction. These will be included at a later time.
You can see that I’m still getting some duplication at the end of the process – I may need to sweep across the final catalogue looking for similar bubbles until I reach a convergence when all bubbles are ‘unique’. I have been experimenting with this with mixed results but will continue my efforts.
If you’re still reading, I look forward to reading your comments. As I continue to make adjustments and progress with this reduction, I shall blog the results again. Many members of the science team are also having a go at this problem and so the final result may be quite different in the end as we improve things. I hope that this is an interesting insight into some of what goes on behind the scenes of the MWP.
12 thoughts on “Reducing the Data”
How does your algorhythm account for the use of “small bubble” boxes? I seem to use the flag and small bubble/green knot/red fuzzy object quite a bit.
Thank you for your time and effort on this project! I’m a complete novice, but I hope that I am contributing in a small way.
This is really interesting. Thanks for putting this up. I am somewhat disconcerted by seeing all the bubbles that have been drawn on the example. I would have identified only a few of the more obvious bubbles. Am I missing something?
Masla1 – no the current reduction does not yet include the ‘small bubbles’. Everyone’s drawings help us to create a better picture of our galaxy at these wavelengths – you are definitely contributing.
Beth – don’t worry. Everyone sees these things in a different way. It may well be that after reading this you’ll start to see fainter bubbles more easily, but that isn’t my intention.
The point of many Zooniverse projects is that we take the crowd’s view on the data. If only a small percentage of people think a bubble exists at a location then that bubble can be seen as less ‘likely’ to be real than a bubble that all users draw. When dealing with hundreds of thousands of bubbles, we can use these weightings to determine how likely a bubble is to be real.
This is so awesome. Could you crowd-source cleaning the data?
I read somewhere in the Talk section that bubbles were counted as real when a large number of people draw them. I get pretty detailed in my bubble drawing because as I learn more and have developed an algorithm for finding bubbles in less than prime representations. I had had some gnawing doubts about whether my efforts were way overboard (waste not a minute, they don’t come back!) and not likely to be counted. From your description of the method you use to sift the data, I am reassured that I should just do my own thing and let the data talk. Thanks for the detailed explanation.
Thank you for updating the blog. To get a view behind the scenes is always impressive. And to follow you how you search for a algorithm make you more Human.
I spent ages drawing the objects and bubbles and the more I looked the more appeared. Each image ( I didn’t study many )seemed to make me tired ( eye strain?) and I thought I was producing overkill but now I’ve read your comments and description of your method of reducement of data it seems I could have done more!
Thank you for the helpful data. The only thing I don’t understand is that the yellow object hasn’t been overlaid in image 2, yet it is circled in image 3.
When we see yellow objects should we circle them as though they are bubbles?
I am so grateful for the information provided by everyone on this project. I absolutely love viewing the images. However, I do wish the actual images were ‘crisper’ like on Glimpse. I find it much easier to identify things in those images than in our images here.
I too had concerns as to my efforts in searching diligently for all the bubbles within each image. I’m so glad to know even if I go all out on marking the bubbles that it does help in some way. Does anyone know when the ‘small bubble’ option as well as all others except the full bubbles will produce useful information for this project?
~Bright Blessings All!
Tina – look out for a blog post very soon about the small bubbles, star clusters etc!