2024-12-13

StoryPark: Download All the Stories!

One of our kid goes to a place that uses this platform called StoryPark to share stories of what they did at school written by the teachers. It’s been a year already and apparently there’s around 195 stories (+/-5 days x +/-48 weeks). The platform allows us parents to download the stories as PDFs but if we were to do that manually, I don’t know man, it probably would’ve taken me days to do so.

Solution: Find the PDF generating endpoint and the story IDs then automate it using python or your preferred scripting language!

And that I did, of course with the age of LLMs (Large Language Models), it was easy to spin up the script that can help with this. I used Claude Sonnet 3.5 for this task and ask it to create a script that will go through the list of story IDs and download the PDFs. Bonus feature is to make it faster, I asked the LLM to make it run with multi threading. Man, LLMs makes these sort of tasks way way easy!

Here’s the code snippet for it:

import requests
import os
import re
from urllib.parse import urlparse, unquote
from datetime import datetime
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor
from functools import partial

def clean_filename(filename):
    filename = filename.strip('"\'')
    if '; filename*=' in filename:
        filename = filename.split('; filename*=')[0]
    filename = unquote(filename)
    unsafe_chars = '<>:"/\\|?*!\''
    for char in unsafe_chars:
        filename = filename.replace(char, '')
    filename = filename.replace(' ', '_')
    return filename.strip()

def download_pdf(url, story_id, pbar):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Cookie': '_session_id=[[ YOUR SESSION ID HERE]]'
        }

        # Get response
        response = requests.get(url, headers=headers, allow_redirects=True)
        response.raise_for_status()

        # Try to get filename from Content-Disposition header
        filename = None
        if 'Content-Disposition' in response.headers:
            content_disposition = response.headers['Content-Disposition']
            if 'filename=' in content_disposition:
                filename = content_disposition.split('filename=')[-1].split(';')[0]
                filename = clean_filename(filename)

        if not filename:
            path = urlparse(unquote(url)).path
            filename = os.path.basename(path)

        if not filename or filename == '':
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"{story_id}_document_{timestamp}.pdf"

        # Add story ID to filename before the extension
        base_name, ext = os.path.splitext(filename)
        filename = f"{story_id}_{base_name}_{ext}"

        if not filename.lower().endswith('.pdf'):
            filename += '.pdf'

        downloads_dir = 'downloads'
        if not os.path.exists(downloads_dir):
            os.makedirs(downloads_dir)

        filepath = os.path.join(downloads_dir, filename)

        # Write the content to file
        with open(filepath, 'wb') as f:
            f.write(response.content)

        # Verify file size
        if os.path.getsize(filepath) == 0:
            print(f"Warning: Downloaded file is empty: {filename}")
            pbar.update(1)
            return None

        pbar.update(1)
        return filepath

    except requests.exceptions.RequestException as e:
        print(f"Error downloading PDF: {e}")
        pbar.update(1)
        return None
    except Exception as e:
        print(f"Unexpected error: {e}")
        pbar.update(1)
        return None

def process_id(id, pbar):
    url = f"https://app.storypark.com/stories/{id}/print_preview.pdf?pdf=true&layout_type=default&fontsize=15&image_padding=0&text_width=426&media_width=284&filtered_options=[\"story_content\",\"comments\",\"learning_tags\",\"title\",\"date\",\"author\"]"
    
    downloaded_file = download_pdf(url, id, pbar)
    if downloaded_file:
        file_size = os.path.getsize(downloaded_file)
        print(f"Downloaded: {downloaded_file} ({file_size} bytes)")
    else:
        print(f"Failed to download story ID: {id}")

if __name__ == "__main__":
    # Read all IDs first
    with open('ids_sorted', 'r') as f:
        ids = [line.strip() for line in f.readlines()]

    # Number of concurrent downloads
    max_workers = 10  # Adjust this number based on your system and the server's limits

    # Create progress bar
    with tqdm(total=len(ids), desc="Downloading PDFs") as pbar:
        # Create thread pool and run downloads
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            # Create partial function with progress bar
            download_with_progress = partial(process_id, pbar=pbar)
            # Execute downloads
            executor.map(download_with_progress, ids)

Then I have a file named ids_sorted with all the story IDs in it. Make sure to replace [[ YOUR SESSION ID HERE ]] with your actual session ID from StoryPark. The IDs were scraped from the website manually, but I guess you could also use some scripting-fu to automate that!

If you have any questions or need further assistance, feel free to reach out!