cd /sparky-scraping/

Cheating Sparky with over 10GB of JSONs, and lots of scripting

· by jill

Scraping images and comparing them to win virtual currency? It's more likely than you think...

Let me set the scene.

Sparky is a Discord bot based around guessing Geometry Dash custom levels and naming them. It's a silly little bot not too many care about - but it's developed sort of a cult following within the community.

Recently, I've been toying around with it in a private server, mostly out of boredom and because, well, to be honest, it's kind of fun. Issue being, I kind of suck at it; and no hints are given to you to let you learn the levels. It's purely a test of how well you know the game's custom levels.

I started to wonder if there was any way you could cheese this system - the official Geometry Dash server gets a lot of usage from the bot in public channels, so I figured surely you could scrape that to get a database of levels and images, letting you recognize levels automatically.

It was just a theoretical idea floating around my head for the longest while - I feared maybe I shouldn't, for the consequences may be not great. But, one day, I did just that, out of sheer curiosity - and it went so much better than I expected.

Chapter 1: Naive Beginnings

First, I needed to scrape the channels. I used DiscordChatExporter on every Sparky channel in the GD server, saving the history as a JSON (regrettably, as you'll see later on) and hoping I could make sense of it later. I figured, surely I could work out the level names just by scanning through the messages, and so I left it on until it eventually errored due to a temporary network failure.

Turns out, it was left on for 12 hours, leaving behind 10GB of JSON. Uh oh.

Yeah, so, I wasn't sure how to deal with this now - the fourth channel download got interrupted by said temporary network failure, and it was still 1.3GB, which is probably larger than any JSON I've ever worked with before.

I... wrote up a quick script, hoping I had enough RAM to just deal with the upfront cost of having it all loaded at once, and...

Yeah, this isn't going to work.

Eventually after battling my way through Node.JS documentation (since I was, for some reason, stubborn on doing it in Node instead of a more appropriate language), I found out about streams and how I could use those to parse the JSON with a library and manipulate it on the fly. Great! That's that mostly done, but it's still a lot of data...

I found out how to process streams through this process, though - with a handy library called stream-json I had managed to actually physically go through the entire set of messages. It took a fairly long while, but that's to be expected - at this point the bottleneck was my hard drive's speed.

Filtering by messages in this dataset sent by Sparky alone, we get around 4,000,000 messages. There's quite a couple! But we have the raw data setup (or so I thought), so now it shouldn't be too hard to throw it at my computer and let it do its thing, right?

Chapter 2: Throwing It At My Computer And Letting It Do Its Thing

Right, so how do we do this?

Each time you start a guessing game with Sparky, you can make guesses towards a level. Eventually, you either get the level, and it'll award you points, or you'll time out and receive nothing. (You lose.)

When we're going through the data, there's only really two messages we're concerned with going through:

  1. The image itself (so we can figure out a level from an image later on)
  1. The response from Sparky, letting us know we've gotten it right

We can extract the image URL from the first and store it for later, and pattern match on the second to get the level title, properly capitalized. That part's the easy part!

I wrote a quick funneling script that kept track of the last image and stored said image with the level name whenever there's a match. It worked perfectly and now we have a CSV of image URLs correlated with level names!

Right. What do we do now?

At first I figured I could deduplicate the list, making sure each level only appears once - this seemed to work, bringing the estimate level count at 2,000, which is around how many you could approximate by working backwards from your completion percentage.
Once we're done with that, it should be as simple as checking the image URLs, right? Wrong! Sparky is quirky in that it always uploads each image for each guessing game, instead of grabbing them from a single static location. I guess in that sense it's smart, to prevent people like me from doing what I'm currently doing.

Well then, we could do the second best thing - compare checksums. Getting the checksums for 2,000 images was pretty quick, and so it was then time for me to put this to the test!

At first it seemed to work, guessing a level correctly:

But it very frequently couldn't, instead failing instantly:

What I figured was happening was that our hashmap is incomplete, since it deduplicates the level list - if a level image is updated or has multiple images, it won't be guaranteed to be guessed correctly. However, deduplicating the list was also the only optimization that kept it from becoming overconvoluted.

If we do not deduplicate the list, we end up with 1,3 million correct guesses we need to deduplicate the hashes of. What do we do now?

Chapter 3: Am I Really Going To Have To Download 1.3 Million Images

(Yes, I really am.)

I wrote up another script to scrub through our previous guess list, made it concurrent for performance, and shoved it onto a server for extra performance. It took a hot minute to run, and I had to spend a few hours fixing networking issues crashing the whole thing, but after a while it came back to me with a nice, clean CSV of hash values and level names, now deduplicated by the hash.

Nice! Now we plug it into our previously made guessing bot, and...

...from my testing, it seems to have a 99% success rate. Which would make sense, because the Sparky channels in the GD Discord are highly active, and practically every level has been guessed there at least once. Win!

I decided from here on out I had two options:

  1. Use this For Evil, and
  2. Use this For Good.

Surprisingly, I chose option 2 - I realized I was about to catch a very long and convoluted cat-and-mouse game if I were to release any data involved in this, so I kept it to myself.

Right, but what could I use this data for? My first thought was to cheat the game, but that's not very fun whatsoever, and I'd want to do it on a throwaway Discord account (which I didn't have the energy to deal with). I instead decided to instead set up a hints system, letting you know the amount of letters in a level, and then revealing its name if it is not guessed correctly.

And so, that's how it worked for a pretty long while. It was very durable - this worked very well, only failing on Legendary levels (that only had a small chance to even appear). Hooray!

Chapter 4: Uh oh

I fire up the bot one day to play with the bot a little bit, only to be greeted by...

This was certainly odd. The bot was no longer recognizing a single level, constantly erroring and not finding a single matching hash. I checked myself manually, and there are 0 results for each image. This all seemed very odd, until I compared 2 images...

The images had been upscaled from 480p to 720p, inexplicably, so the SHA256 hashes in our hashmap.txt no longer matched. Hm. This kind of invalidated our entire dataset...

Well, fuck! That's kind of silly! Huh! God damn!

For a normal person, this would've just been an "oh well" moment - it's broken, move on in life. For my neurodivergent ass, not so much. I dropped everything I was doing to work on a solution. (Not actually. It took me a week or so..) It was back to the drawing board for me...

One solution I had always considered even earlier on is hashing the image based on the image contents itself rather than a standard file hash. That way, tiny differences in the image would result in tiny differences in the hash, rather than an entirely different hash. However, looking for these proved rather difficult when starting out.

I tried again, and - found a handy little algorithm implemented by a handy little package. This one uses a clever solution:

  1. Reduce the size of the image, to make the next steps easier; 32x32 works well
  2. Reduce the colors into just plain grayscale
  3. Compute the discrete cosine transform, which is what, for instance, JPEG uses to decompose an image into a series of frequencies and scalars
  4. Only keep the top-left 8x8, since they represent the lowest frequencies
  5. Compute the mean of the DCT for later
  6. Use the mean as a threshold to turn each frequency into a single bit, leaving us with 64 bits, which is our hash!

I re-ran the hashing script again, now with this algorithm, and used the newly generated hashmap (which ended up being smaller) for the same bot.

And... it works!

This time, I compared hashes using Hamming distance rather than just finding an exact hash - this gave me a way to give the closest guess, which is not amazingly useful, but would make it future-proof, even if additional elements were overlaid after.

I was pretty proud of this solution! So proud, in fact, that I decided to step up the scale of operations. I had ironed out just about everything about this whole operation - so it was about time I upgraded my dataset.

Chapter 5: Throwing It At My Computer And Letting It Do Its Thing, Again

I threw the Discord exporter onto a remote server this time, exporting the cumulative messages from 2 Discord servers rather than just 1, still in JSON format (because there's no other supported format that will give me the embeds' attachments) but now partitioned by 50MB. I also filtered exclusively by Sparky's messages, since that's all I ended up needing.

This gave me around 9.7GB of data - giving me just under what I got when scanning less channels due to the filtering. This already meant I could process the raw data a lot faster!

Next, I edited the first script to store the difficulties aswell for efficiency - this gave me every bit of metadata a level could have.

Adjusting the other scripts to account for the newly stored difficulty, I started running every step of the process. It took around 3 days! It was a grueling process due to many unforeseen error, segmentation faults, and other nonsense. But that left me with one small, albeit very powerful file.

This had contained a CSV of every unique phash, along with the level name and difficulty. Standing at 7,172 hashes, it was the result of all my hard work. Plugging it into my previously created bot gave me a nearly perfect success rate. (It still didn't get a couple of Legendary levels, but it's likely they've never been guessed before.)

This left me back at the question of what to do with this, however. I started this project as a proof-of-concept, and didn't consider what to do once it works.

And that's where I stand right now! I've decided I will just keep the code and the results to myself, only sharing the experience of such a unique project. It was fun!