StoryPark: Download All the Stories!
One of our kid goes to a place that uses this platform called StoryPark to share stories of what they did at school written by the teachers. It’s been a year already and apparently there’s around 195 stories (+/-5 days x +/-48 weeks). The platform allows us parents to download the stories as PDFs but if we were to do that manually, I don’t know man, it probably would’ve taken me days to do so.
Solution: Find the PDF generating endpoint and the story IDs then automate it using python or your preferred scripting language!
And that I did, of course with the age of LLMs (Large Language Models), it was easy to spin up the script that can help with this. I used Claude Sonnet 3.5 for this task and ask it to create a script that will go through the list of story IDs and download the PDFs. Bonus feature is to make it faster, I asked the LLM to make it run with multi threading. Man, LLMs makes these sort of tasks way way easy!
Here’s the code snippet for it:
import requests
import os
import re
from urllib.parse import urlparse, unquote
from datetime import datetime
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor
from functools import partial
def clean_filename(filename):
filename = filename.strip('"\'')
if '; filename*=' in filename:
filename = filename.split('; filename*=')[0]
filename = unquote(filename)
unsafe_chars = '<>:"/\\|?*!\''
for char in unsafe_chars:
filename = filename.replace(char, '')
filename = filename.replace(' ', '_')
return filename.strip()
def download_pdf(url, story_id, pbar):
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Cookie': '_session_id=[[ YOUR SESSION ID HERE]]'
}
# Get response
response = requests.get(url, headers=headers, allow_redirects=True)
response.raise_for_status()
# Try to get filename from Content-Disposition header
filename = None
if 'Content-Disposition' in response.headers:
content_disposition = response.headers['Content-Disposition']
if 'filename=' in content_disposition:
filename = content_disposition.split('filename=')[-1].split(';')[0]
filename = clean_filename(filename)
if not filename:
path = urlparse(unquote(url)).path
filename = os.path.basename(path)
if not filename or filename == '':
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"{story_id}_document_{timestamp}.pdf"
# Add story ID to filename before the extension
base_name, ext = os.path.splitext(filename)
filename = f"{story_id}_{base_name}_{ext}"
if not filename.lower().endswith('.pdf'):
filename += '.pdf'
downloads_dir = 'downloads'
if not os.path.exists(downloads_dir):
os.makedirs(downloads_dir)
filepath = os.path.join(downloads_dir, filename)
# Write the content to file
with open(filepath, 'wb') as f:
f.write(response.content)
# Verify file size
if os.path.getsize(filepath) == 0:
print(f"Warning: Downloaded file is empty: {filename}")
pbar.update(1)
return None
pbar.update(1)
return filepath
except requests.exceptions.RequestException as e:
print(f"Error downloading PDF: {e}")
pbar.update(1)
return None
except Exception as e:
print(f"Unexpected error: {e}")
pbar.update(1)
return None
def process_id(id, pbar):
url = f"https://app.storypark.com/stories/{id}/print_preview.pdf?pdf=true&layout_type=default&fontsize=15&image_padding=0&text_width=426&media_width=284&filtered_options=[\"story_content\",\"comments\",\"learning_tags\",\"title\",\"date\",\"author\"]"
downloaded_file = download_pdf(url, id, pbar)
if downloaded_file:
file_size = os.path.getsize(downloaded_file)
print(f"Downloaded: {downloaded_file} ({file_size} bytes)")
else:
print(f"Failed to download story ID: {id}")
if __name__ == "__main__":
# Read all IDs first
with open('ids_sorted', 'r') as f:
ids = [line.strip() for line in f.readlines()]
# Number of concurrent downloads
max_workers = 10 # Adjust this number based on your system and the server's limits
# Create progress bar
with tqdm(total=len(ids), desc="Downloading PDFs") as pbar:
# Create thread pool and run downloads
with ThreadPoolExecutor(max_workers=max_workers) as executor:
# Create partial function with progress bar
download_with_progress = partial(process_id, pbar=pbar)
# Execute downloads
executor.map(download_with_progress, ids)
Then I have a file named ids_sorted
with all the story IDs in it. Make sure to replace [[ YOUR SESSION ID HERE ]]
with your actual session ID from StoryPark. The IDs were scraped from the website manually, but I guess you could also use some scripting-fu to automate that!
If you have any questions or need further assistance, feel free to reach out!