2.0-alpha1: tags dropped, favs dropped, bookmarks dropped, reposts dropped, better async rendering; TODO comments, websub pings, webmentions
This commit is contained in:
parent
112448cf92
commit
4a699ef9f5
7 changed files with 1495 additions and 2234 deletions
168
README.md
168
README.md
|
@ -1,8 +1,166 @@
|
||||||
# NASG: Not Another Statig Generator...
|
# NASG (Not Another Statig Generator)
|
||||||
|
|
||||||
So I ended up writing my static generator and this is (most) of the code for it.
|
This is a tiny static site generator, written in Python, to scratch my own itches.
|
||||||
|
It is most probably not suitable for anyone else.
|
||||||
|
|
||||||
Don't expect anything fancy and please be aware that my Python Fu has much to learn.
|
## Why not [insert static generator here]?
|
||||||
|
|
||||||
I've written about the generic ideas and approaches here in my
|
- DRY -Don't Repeat Yourself - is good, so instead of sidefiles for images, I'm using XMP metadata, which most of the ones availabe don't handle well;
|
||||||
[Going Static](https://petermolnar.net/going-static/) entry.
|
- writing a proper plugin to existing generators - Pelican, Nicola, etc - might have taken longer and I wanted to extend my Python knowledge
|
||||||
|
- I wanted to use the best available utilities for some tasks, like `Pandoc` and `exiftool` instead of Python libraries trying to achive the same
|
||||||
|
- I needed to handle webmentions and comments
|
||||||
|
|
||||||
|
Don't expect anything fancy: my Python Fu has much to learn.
|
||||||
|
|
||||||
|
## How content is organized
|
||||||
|
|
||||||
|
The directory structure of the "source" is something like this:
|
||||||
|
```
|
||||||
|
├── content
|
||||||
|
│ ├── category1 (containing YAML + MD files)
|
||||||
|
│ ├── category2 (containing YAML + MD files)
|
||||||
|
│ ├── photo (containing jpg files)
|
||||||
|
│ ├── _category_excluded_from_listing_1 (containing YAML + MD files)
|
||||||
|
|
||||||
|
├── files
|
||||||
|
│ ├── image (my own pictures)
|
||||||
|
│ ├── photo -> ../content/photo
|
||||||
|
│ └── pic (random images)
|
||||||
|
├── nasg
|
||||||
|
│ ├── archive.py
|
||||||
|
│ ├── config.ini
|
||||||
|
│ ├── db.py
|
||||||
|
│ ├── LICENSE
|
||||||
|
│ ├── nasg.py
|
||||||
|
│ ├── README.md
|
||||||
|
│ ├── requirements.txt
|
||||||
|
│ ├── router.py
|
||||||
|
│ ├── shared.py
|
||||||
|
│ └── templates
|
||||||
|
├── static
|
||||||
|
│ ├── favicon.ico
|
||||||
|
│ ├── favicon.png
|
||||||
|
│ └── pgp.asc
|
||||||
|
└── var
|
||||||
|
├── gone.tsv
|
||||||
|
├── redirects.tsv
|
||||||
|
├── s.sqlite
|
||||||
|
├── tokens.json
|
||||||
|
└── webmention.sqlite
|
||||||
|
```
|
||||||
|
|
||||||
|
Content files can be in either YAML and Markdown, with `.md` extension, or JPG with metadata, with `.jpg` extension.
|
||||||
|
|
||||||
|
Inline images in the content are checked against all subdirectories in `files` ; they get their EXIF read and displayed as well if they match the regex in the configuration for the Artist and/or Copyright EXIF fields.
|
||||||
|
|
||||||
|
`gone.tsv` is a simple list of URIs that should return a `410 Gone` message while `redirect.tsv` is a tab separated file of `from to` entries that should be `301` redirected. These go into a magic.php file, so if the host supports executing PHP, it will take care of this.
|
||||||
|
|
||||||
|
## Output
|
||||||
|
|
||||||
|
`nasg.py` generates a `build` directory which will have an directory per entry, with an `index.html`, so urls can be `https://domain.com/filename/`.
|
||||||
|
|
||||||
|
Categories are rendered into `category/category_name`. Pagination is under `category/category_name/page/X`. They include a feed as well, `category/category_name/feed`, in form if an `index.atom` ATOM feed.
|
||||||
|
|
||||||
|
## Webserver configuration
|
||||||
|
|
||||||
|
A minimal nginx configuration for the virtualhost:
|
||||||
|
```
|
||||||
|
# --- Virtual Host ---
|
||||||
|
upstream {{ domain }} {
|
||||||
|
server unix:/var/run/php/{{ domain }}.sock;
|
||||||
|
}
|
||||||
|
|
||||||
|
server {
|
||||||
|
listen 80;
|
||||||
|
server_name .{{ domain }};
|
||||||
|
rewrite ^ https://$server_name$request_uri redirect;
|
||||||
|
access_log /dev/null;
|
||||||
|
error_log /dev/null;
|
||||||
|
}
|
||||||
|
|
||||||
|
server {
|
||||||
|
listen 443 ssl http2;
|
||||||
|
server_name .{{ domain }};
|
||||||
|
ssl_certificate /etc/letsencrypt/live/{{ domain }}/fullchain.pem;
|
||||||
|
ssl_certificate_key /etc/letsencrypt/live/{{ domain }}/privkey.pem;
|
||||||
|
ssl_dhparam dh.pem;
|
||||||
|
add_header X-Frame-Options "SAMEORIGIN";
|
||||||
|
add_header X-Content-Type-Options "nosniff";
|
||||||
|
add_header X-XSS-Protection "1; mode=block";
|
||||||
|
add_header Strict-Transport-Security "max-age=31536000; includeSubdomains;";
|
||||||
|
|
||||||
|
root /[path to root]/{{ domain }};
|
||||||
|
|
||||||
|
location = /favicon.ico {
|
||||||
|
log_not_found off;
|
||||||
|
access_log off;
|
||||||
|
}
|
||||||
|
|
||||||
|
location = /robots.txt {
|
||||||
|
log_not_found off;
|
||||||
|
access_log off;
|
||||||
|
}
|
||||||
|
|
||||||
|
location ~ ^(?<script_name>.+?\.php)(?<path_info>.*)$ {
|
||||||
|
try_files $uri $script_name =404;
|
||||||
|
fastcgi_param SCRIPT_FILENAME $document_root$script_name;
|
||||||
|
fastcgi_param SCRIPT_NAME $script_name;
|
||||||
|
fastcgi_param PATH_INFO $path_info;
|
||||||
|
fastcgi_param PATH_TRANSLATED $document_root$path_info;
|
||||||
|
fastcgi_param QUERY_STRING $query_string;
|
||||||
|
fastcgi_param REQUEST_METHOD $request_method;
|
||||||
|
fastcgi_param CONTENT_TYPE $content_type;
|
||||||
|
fastcgi_param CONTENT_LENGTH $content_length;
|
||||||
|
fastcgi_param SCRIPT_NAME $script_name;
|
||||||
|
fastcgi_param REQUEST_URI $request_uri;
|
||||||
|
fastcgi_param DOCUMENT_URI $document_uri;
|
||||||
|
fastcgi_param DOCUMENT_ROOT $document_root;
|
||||||
|
fastcgi_param SERVER_PROTOCOL $server_protocol;
|
||||||
|
fastcgi_param GATEWAY_INTERFACE CGI/1.1;
|
||||||
|
fastcgi_param SERVER_SOFTWARE nginx;
|
||||||
|
fastcgi_param REMOTE_ADDR $remote_addr;
|
||||||
|
fastcgi_param REMOTE_PORT $remote_port;
|
||||||
|
fastcgi_param SERVER_ADDR $server_addr;
|
||||||
|
fastcgi_param SERVER_PORT $server_port;
|
||||||
|
fastcgi_param SERVER_NAME $server_name;
|
||||||
|
fastcgi_param HTTP_PROXY "";
|
||||||
|
fastcgi_param HTTPS $https if_not_empty;
|
||||||
|
fastcgi_param SSL_PROTOCOL $ssl_protocol if_not_empty;
|
||||||
|
fastcgi_param SSL_CIPHER $ssl_cipher if_not_empty;
|
||||||
|
fastcgi_param SSL_SESSION_ID $ssl_session_id if_not_empty;
|
||||||
|
fastcgi_param SSL_CLIENT_VERIFY $ssl_client_verify if_not_empty;
|
||||||
|
fastcgi_param REDIRECT_STATUS 200;
|
||||||
|
fastcgi_index index.php;
|
||||||
|
fastcgi_connect_timeout 10;
|
||||||
|
fastcgi_send_timeout 360;
|
||||||
|
fastcgi_read_timeout 3600;
|
||||||
|
fastcgi_buffer_size 512k;
|
||||||
|
fastcgi_buffers 512 512k;
|
||||||
|
fastcgi_keep_conn on;
|
||||||
|
fastcgi_intercept_errors on;
|
||||||
|
fastcgi_split_path_info ^(?<script_name>.+?\.php)(?<path_info>.*)$;
|
||||||
|
fastcgi_pass {{ domain }};
|
||||||
|
}
|
||||||
|
|
||||||
|
location / {
|
||||||
|
try_files $uri $uri/ $uri.html $uri/index.html $uri/index.xml $uri/index.atom index.php @rewrites;
|
||||||
|
}
|
||||||
|
|
||||||
|
location @rewrites {
|
||||||
|
rewrite ^ /magic.php?$args last;
|
||||||
|
}
|
||||||
|
|
||||||
|
location ~* \.(css|js|eot|woff|ttf|woff2)$ {
|
||||||
|
expires 1d;
|
||||||
|
add_header Cache-Control "public, must-revalidate, proxy-revalidate";
|
||||||
|
add_header "Vary" "Accept-Encoding";
|
||||||
|
}
|
||||||
|
|
||||||
|
location ~* \.(png|ico|gif|svg|jpg|jpeg|webp|avi|mpg|mpeg|mp4|mp3)$ {
|
||||||
|
expires 7d;
|
||||||
|
add_header Cache-Control "public, must-revalidate, proxy-revalidate";
|
||||||
|
add_header "Vary" "Accept-Encoding";
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
```
|
||||||
|
|
69
archive.py
69
archive.py
|
@ -5,14 +5,16 @@ import glob
|
||||||
import logging
|
import logging
|
||||||
import shutil
|
import shutil
|
||||||
import subprocess
|
import subprocess
|
||||||
|
import imghdr
|
||||||
import arrow
|
import arrow
|
||||||
|
|
||||||
|
from pprint import pprint
|
||||||
|
|
||||||
from requests_oauthlib import OAuth1Session, oauth1_session, OAuth2Session, oauth2_session
|
from requests_oauthlib import OAuth1Session, oauth1_session, OAuth2Session, oauth2_session
|
||||||
from oauthlib.oauth2 import BackendApplicationClient
|
from oauthlib.oauth2 import BackendApplicationClient
|
||||||
|
import db
|
||||||
import shared
|
import shared
|
||||||
|
|
||||||
|
|
||||||
class Favs(object):
|
class Favs(object):
|
||||||
def __init__(self, confgroup):
|
def __init__(self, confgroup):
|
||||||
self.confgroup = confgroup
|
self.confgroup = confgroup
|
||||||
|
@ -101,6 +103,7 @@ class FlickrFavs(Favs):
|
||||||
fav = FlickrFav(photo)
|
fav = FlickrFav(photo)
|
||||||
if not fav.exists:
|
if not fav.exists:
|
||||||
fav.run()
|
fav.run()
|
||||||
|
#fav.fix_extension()
|
||||||
|
|
||||||
class FivehpxFavs(Favs):
|
class FivehpxFavs(Favs):
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
|
@ -179,6 +182,7 @@ class FivehpxFavs(Favs):
|
||||||
fav = FivehpxFav(photo)
|
fav = FivehpxFav(photo)
|
||||||
if not fav.exists:
|
if not fav.exists:
|
||||||
fav.run()
|
fav.run()
|
||||||
|
#fav.fix_extension()
|
||||||
|
|
||||||
|
|
||||||
class TumblrFavs(Favs):
|
class TumblrFavs(Favs):
|
||||||
|
@ -242,7 +246,7 @@ class DAFavs(Favs):
|
||||||
'https://www.deviantart.com/api/v1/oauth2/collections/folders',
|
'https://www.deviantart.com/api/v1/oauth2/collections/folders',
|
||||||
params={
|
params={
|
||||||
'username': self.username,
|
'username': self.username,
|
||||||
'calculate_size': 'false',
|
'calculate_size': 'true',
|
||||||
'ext_preload': 'false',
|
'ext_preload': 'false',
|
||||||
'mature_content': 'true'
|
'mature_content': 'true'
|
||||||
}
|
}
|
||||||
|
@ -304,29 +308,29 @@ class DAFavs(Favs):
|
||||||
has_more = self.has_more(js.get('has_more'))
|
has_more = self.has_more(js.get('has_more'))
|
||||||
offset = js.get('next_offset')
|
offset = js.get('next_offset')
|
||||||
while True == has_more:
|
while True == has_more:
|
||||||
logging.info('iterating over DA results with offset %d', offset)
|
#logging.info('iterating over DA results with offset %d', offset)
|
||||||
paged = self.getpaged(offset)
|
paged = self.getpaged(offset)
|
||||||
new = paged.get('results', [])
|
new = paged.get('results', [])
|
||||||
if not len(new):
|
if not len(new):
|
||||||
#logging.error('empty results from deviantART, breaking loop')
|
#logging.error('empty results from deviantART, breaking loop')
|
||||||
break
|
break
|
||||||
favs = favs + new
|
favs = [*favs, *new]
|
||||||
has_more = self.has_more(paged.get('has_more'))
|
has_more = self.has_more(paged.get('has_more'))
|
||||||
if not has_more:
|
if not has_more:
|
||||||
break
|
break
|
||||||
n = int(paged.get('next_offset'))
|
n = int(paged.get('next_offset'))
|
||||||
if not n:
|
if not n:
|
||||||
break
|
break
|
||||||
offset = offset + n
|
offset = n
|
||||||
|
|
||||||
self.favs = favs
|
self.favs = favs
|
||||||
for fav in self.favs:
|
for fav in self.favs:
|
||||||
f = DAFav(fav)
|
f = DAFav(fav)
|
||||||
if f.exists:
|
if not f.exists:
|
||||||
continue
|
|
||||||
|
|
||||||
f.fav.update({'meta': self.getsinglemeta(fav.get('deviationid'))})
|
f.fav.update({'meta': self.getsinglemeta(fav.get('deviationid'))})
|
||||||
f.run()
|
f.run()
|
||||||
|
#f.fix_extension()
|
||||||
|
|
||||||
|
|
||||||
class ImgFav(object):
|
class ImgFav(object):
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
|
@ -349,7 +353,19 @@ class ImgFav(object):
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def exists(self):
|
def exists(self):
|
||||||
return os.path.exists(self.target)
|
maybe = glob.glob(self.target.replace('.jpg', '.*'))
|
||||||
|
if len(maybe):
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
def fix_extension(self):
|
||||||
|
# identify file format
|
||||||
|
what = imghdr.what(self.target)
|
||||||
|
# rename file
|
||||||
|
new = self.target.replace('.jpg', '.%s' % what)
|
||||||
|
if new != self.target:
|
||||||
|
shutil.move(self.target, new)
|
||||||
|
self.target = new
|
||||||
|
|
||||||
def pull_image(self):
|
def pull_image(self):
|
||||||
logging.info("pulling image %s to %s", self.imgurl, self.target)
|
logging.info("pulling image %s to %s", self.imgurl, self.target)
|
||||||
|
@ -359,8 +375,11 @@ class ImgFav(object):
|
||||||
r.raw.decode_content = True
|
r.raw.decode_content = True
|
||||||
shutil.copyfileobj(r.raw, f)
|
shutil.copyfileobj(r.raw, f)
|
||||||
|
|
||||||
|
|
||||||
def write_exif(self):
|
def write_exif(self):
|
||||||
|
what = imghdr.what(self.target)
|
||||||
|
if 'jpg' != what or 'png' != what:
|
||||||
|
return
|
||||||
|
|
||||||
logging.info('populating EXIF data of %s' % self.target)
|
logging.info('populating EXIF data of %s' % self.target)
|
||||||
tags = list(set(self.meta.get('tags',[])))
|
tags = list(set(self.meta.get('tags',[])))
|
||||||
dt = self.meta.get('dt').to('utc')
|
dt = self.meta.get('dt').to('utc')
|
||||||
|
@ -387,7 +406,7 @@ class ImgFav(object):
|
||||||
params = [
|
params = [
|
||||||
'exiftool',
|
'exiftool',
|
||||||
'-overwrite_original',
|
'-overwrite_original',
|
||||||
'-EXIF:Artist=%s' % author_name[:64],
|
#'-EXIF:Artist=%s' % author_name[:64],
|
||||||
'-XMP:Copyright=Copyright %s %s (%s)' % (
|
'-XMP:Copyright=Copyright %s %s (%s)' % (
|
||||||
dt.format('YYYY'),
|
dt.format('YYYY'),
|
||||||
author_name,
|
author_name,
|
||||||
|
@ -501,6 +520,7 @@ class FlickrFav(ImgFav):
|
||||||
self.photo.get('description', {}).get('_content', '')
|
self.photo.get('description', {}).get('_content', '')
|
||||||
)
|
)
|
||||||
|
|
||||||
|
self.fix_extension()
|
||||||
self.write_exif()
|
self.write_exif()
|
||||||
|
|
||||||
class FivehpxFav(ImgFav):
|
class FivehpxFav(ImgFav):
|
||||||
|
@ -546,12 +566,14 @@ class FivehpxFav(ImgFav):
|
||||||
}
|
}
|
||||||
c = "%s" % self.photo.get('description', '')
|
c = "%s" % self.photo.get('description', '')
|
||||||
self.content = shared.Pandoc('plain').convert(c)
|
self.content = shared.Pandoc('plain').convert(c)
|
||||||
|
self.fix_extension()
|
||||||
self.write_exif()
|
self.write_exif()
|
||||||
|
|
||||||
class DAFav(ImgFav):
|
class DAFav(ImgFav):
|
||||||
def __init__(self, fav):
|
def __init__(self, fav):
|
||||||
self.fav = fav
|
self.fav = fav
|
||||||
self.deviationid = fav.get('deviationid')
|
self.deviationid = fav.get('deviationid')
|
||||||
|
#logging.info('working on %s', self.deviationid)
|
||||||
self.url = fav.get('url')
|
self.url = fav.get('url')
|
||||||
self.title = fav.get('title', False) or self.deviationid
|
self.title = fav.get('title', False) or self.deviationid
|
||||||
self.author = self.fav.get('author').get('username')
|
self.author = self.fav.get('author').get('username')
|
||||||
|
@ -562,9 +584,21 @@ class DAFav(ImgFav):
|
||||||
shared.slugfname(self.author)
|
shared.slugfname(self.author)
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
|
|
||||||
|
self.imgurl = None
|
||||||
|
if 'content' in fav:
|
||||||
|
if 'src' in fav['content']:
|
||||||
|
self.imgurl = fav.get('content').get('src')
|
||||||
|
elif 'preview' in fav:
|
||||||
|
if 'src' in fav['preview']:
|
||||||
|
self.imgurl = fav.get('preview').get('src')
|
||||||
self.imgurl = fav.get('content', {}).get('src')
|
self.imgurl = fav.get('content', {}).get('src')
|
||||||
|
|
||||||
def run(self):
|
def run(self):
|
||||||
|
if not self.imgurl:
|
||||||
|
logging.error('imgurl is empty for deviantart %s', self.deviationid)
|
||||||
|
return
|
||||||
|
|
||||||
self.pull_image()
|
self.pull_image()
|
||||||
|
|
||||||
self.meta = {
|
self.meta = {
|
||||||
|
@ -583,6 +617,7 @@ class DAFav(ImgFav):
|
||||||
}
|
}
|
||||||
c = "%s" % self.fav.get('meta', {}).get('description', '')
|
c = "%s" % self.fav.get('meta', {}).get('description', '')
|
||||||
self.content = shared.Pandoc('plain').convert(c)
|
self.content = shared.Pandoc('plain').convert(c)
|
||||||
|
self.fix_extension()
|
||||||
self.write_exif()
|
self.write_exif()
|
||||||
|
|
||||||
|
|
||||||
|
@ -600,7 +635,10 @@ class TumblrFav(object):
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def exists(self):
|
def exists(self):
|
||||||
return os.path.exists(self.target.replace('.jpg', '_0.jpg'))
|
maybe = glob.glob(self.target.replace('.jpg', '_0.*'))
|
||||||
|
if len(maybe):
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
def run(self):
|
def run(self):
|
||||||
content = "%s" % self.like.get('caption', '')
|
content = "%s" % self.like.get('caption', '')
|
||||||
|
@ -635,6 +673,7 @@ class TumblrFav(object):
|
||||||
img.content = content
|
img.content = content
|
||||||
img.meta = meta
|
img.meta = meta
|
||||||
img.pull_image()
|
img.pull_image()
|
||||||
|
img.fix_extension()
|
||||||
img.write_exif()
|
img.write_exif()
|
||||||
icntr = icntr + 1
|
icntr = icntr + 1
|
||||||
|
|
||||||
|
@ -681,7 +720,7 @@ class Oauth1Flow(object):
|
||||||
self.service = service
|
self.service = service
|
||||||
self.key = shared.config.get("api_%s" % service, 'api_key')
|
self.key = shared.config.get("api_%s" % service, 'api_key')
|
||||||
self.secret = shared.config.get("api_%s" % service, 'api_secret')
|
self.secret = shared.config.get("api_%s" % service, 'api_secret')
|
||||||
self.tokendb = shared.TokenDB()
|
self.tokendb = db.TokenDB()
|
||||||
self.t = self.tokendb.get_service(self.service)
|
self.t = self.tokendb.get_service(self.service)
|
||||||
self.oauth_init()
|
self.oauth_init()
|
||||||
|
|
||||||
|
@ -796,7 +835,7 @@ class TumblrOauth(Oauth1Flow):
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
logging.basicConfig(level=10)
|
logging.basicConfig(level=20)
|
||||||
|
|
||||||
flickr = FlickrFavs()
|
flickr = FlickrFavs()
|
||||||
flickr.run()
|
flickr.run()
|
||||||
|
|
234
db.py
Normal file
234
db.py
Normal file
|
@ -0,0 +1,234 @@
|
||||||
|
import os
|
||||||
|
import json
|
||||||
|
import sqlite3
|
||||||
|
import glob
|
||||||
|
import shared
|
||||||
|
|
||||||
|
# TODO sqlite3 cache instead of filesystem ?
|
||||||
|
|
||||||
|
class TokenDB(object):
|
||||||
|
def __init__(self, uuid='tokens'):
|
||||||
|
self.db = shared.config.get('var', 'tokendb')
|
||||||
|
self.tokens = {}
|
||||||
|
self.refresh()
|
||||||
|
|
||||||
|
def refresh(self):
|
||||||
|
self.tokens = {}
|
||||||
|
if os.path.isfile(self.db):
|
||||||
|
with open(self.db, 'rt') as f:
|
||||||
|
self.tokens = json.loads(f.read())
|
||||||
|
|
||||||
|
def save(self):
|
||||||
|
with open(self.db, 'wt') as f:
|
||||||
|
f.write(json.dumps(
|
||||||
|
self.tokens, indent=4, sort_keys=True
|
||||||
|
))
|
||||||
|
|
||||||
|
def get_token(self, token):
|
||||||
|
return self.tokens.get(token, None)
|
||||||
|
|
||||||
|
def get_service(self, service):
|
||||||
|
token = self.tokens.get(service, None)
|
||||||
|
return token
|
||||||
|
|
||||||
|
def set_service(self, service, tokenid):
|
||||||
|
self.tokens.update({
|
||||||
|
service: tokenid
|
||||||
|
})
|
||||||
|
self.save()
|
||||||
|
|
||||||
|
def update_token(self,
|
||||||
|
token,
|
||||||
|
oauth_token_secret=None,
|
||||||
|
access_token=None,
|
||||||
|
access_token_secret=None,
|
||||||
|
verifier=None):
|
||||||
|
|
||||||
|
t = self.tokens.get(token, {})
|
||||||
|
if oauth_token_secret:
|
||||||
|
t.update({
|
||||||
|
'oauth_token_secret': oauth_token_secret
|
||||||
|
})
|
||||||
|
if access_token:
|
||||||
|
t.update({
|
||||||
|
'access_token': access_token
|
||||||
|
})
|
||||||
|
if access_token_secret:
|
||||||
|
t.update({
|
||||||
|
'access_token_secret': access_token_secret
|
||||||
|
})
|
||||||
|
if verifier:
|
||||||
|
t.update({
|
||||||
|
'verifier': verifier
|
||||||
|
})
|
||||||
|
|
||||||
|
self.tokens.update({
|
||||||
|
token: t
|
||||||
|
})
|
||||||
|
self.save()
|
||||||
|
|
||||||
|
def clear(self):
|
||||||
|
self.tokens = {}
|
||||||
|
self.save()
|
||||||
|
|
||||||
|
def clear_service(self, service):
|
||||||
|
t = self.tokens.get(service)
|
||||||
|
if t:
|
||||||
|
del(self.tokens[t])
|
||||||
|
del(self.tokens[service])
|
||||||
|
self.save()
|
||||||
|
|
||||||
|
class SearchDB(object):
|
||||||
|
tmplfile = 'Search.html'
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.db = sqlite3.connect(
|
||||||
|
"%s" % shared.config.get('var', 'searchdb')
|
||||||
|
)
|
||||||
|
|
||||||
|
cursor = self.db.cursor()
|
||||||
|
cursor.execute('''CREATE VIRTUAL TABLE IF NOT EXISTS data USING FTS5(
|
||||||
|
id,
|
||||||
|
corpus,
|
||||||
|
mtime,
|
||||||
|
url,
|
||||||
|
category,
|
||||||
|
title
|
||||||
|
)''')
|
||||||
|
self.db.commit()
|
||||||
|
|
||||||
|
def __exit__(self):
|
||||||
|
self.finish()
|
||||||
|
|
||||||
|
def finish(self):
|
||||||
|
self.db.close()
|
||||||
|
|
||||||
|
def append(self, id, corpus, mtime, url, category, title):
|
||||||
|
mtime = int(mtime)
|
||||||
|
cursor = self.db.cursor()
|
||||||
|
cursor.execute('''UPDATE data SET corpus=?, mtime=?, url=?, category=?, title=? WHERE id=?;''', (
|
||||||
|
corpus,
|
||||||
|
mtime,
|
||||||
|
url,
|
||||||
|
category,
|
||||||
|
title,
|
||||||
|
id
|
||||||
|
))
|
||||||
|
cursor.execute('''INSERT OR IGNORE INTO data (id, corpus, mtime, url, category, title) VALUES (?,?,?,?,?,?);''', (
|
||||||
|
id,
|
||||||
|
corpus,
|
||||||
|
mtime,
|
||||||
|
url,
|
||||||
|
category,
|
||||||
|
title
|
||||||
|
))
|
||||||
|
self.db.commit()
|
||||||
|
|
||||||
|
def is_uptodate(self, fname, mtime):
|
||||||
|
ret = {}
|
||||||
|
cursor = self.db.cursor()
|
||||||
|
cursor.execute('''SELECT mtime
|
||||||
|
FROM data
|
||||||
|
WHERE id = ? AND mtime = ?''',
|
||||||
|
(fname,mtime)
|
||||||
|
)
|
||||||
|
rows = cursor.fetchall()
|
||||||
|
if len(rows):
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
def search_by_query(self, query):
|
||||||
|
ret = {}
|
||||||
|
cursor = self.db.cursor()
|
||||||
|
cursor.execute('''SELECT
|
||||||
|
id, category, url, title, highlight(data, 0, '<strong>', '</strong>') corpus
|
||||||
|
FROM data
|
||||||
|
WHERE data MATCH ?
|
||||||
|
ORDER BY category, rank;''', (query,))
|
||||||
|
rows = cursor.fetchall()
|
||||||
|
for r in rows:
|
||||||
|
r = {
|
||||||
|
'id': r[0],
|
||||||
|
'category': r[1],
|
||||||
|
'url': r[2],
|
||||||
|
'title': r[3],
|
||||||
|
'txt': r[4],
|
||||||
|
}
|
||||||
|
|
||||||
|
category = r.get('category')
|
||||||
|
if category not in ret:
|
||||||
|
ret.update({category: {}})
|
||||||
|
|
||||||
|
|
||||||
|
maybe_fpath = os.path.join(
|
||||||
|
shared.config.get('dirs', 'content'),
|
||||||
|
category,
|
||||||
|
"%s.*" % r.get('id')
|
||||||
|
)
|
||||||
|
#fpath = glob.glob(maybe_fpath).pop()
|
||||||
|
ret.get(category).update({
|
||||||
|
r.get('id'): {
|
||||||
|
#'fpath': fpath,
|
||||||
|
'url': r.get('url'),
|
||||||
|
'title': r.get('title'),
|
||||||
|
'txt': r.get('txt')
|
||||||
|
}
|
||||||
|
})
|
||||||
|
return ret
|
||||||
|
|
||||||
|
|
||||||
|
def cli(self, query):
|
||||||
|
results = self.search_by_query(query)
|
||||||
|
for c, items in sorted(results.items()):
|
||||||
|
print("%s:" % c)
|
||||||
|
for fname, data in sorted(items.items()):
|
||||||
|
print(" %s" % data.get('fpath'))
|
||||||
|
print(" %s" % data.get('url'))
|
||||||
|
print("")
|
||||||
|
|
||||||
|
def html(self, query):
|
||||||
|
tmplvars = {
|
||||||
|
'results': self.search_by_query(query),
|
||||||
|
'term': query
|
||||||
|
}
|
||||||
|
return shared.j2.get_template(self.tmplfile).render(tmplvars)
|
||||||
|
|
||||||
|
|
||||||
|
class WebmentionQueue(object):
|
||||||
|
def __init__(self):
|
||||||
|
self.db = sqlite3.connect(
|
||||||
|
"%s" % shared.config.get('var', 'webmentiondb')
|
||||||
|
)
|
||||||
|
|
||||||
|
cursor = self.db.cursor()
|
||||||
|
cursor.execute('''CREATE TABLE IF NOT EXISTS `archive` (
|
||||||
|
`id` INTEGER PRIMARY KEY AUTOINCREMENT UNIQUE,
|
||||||
|
`received` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||||||
|
`processed` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||||||
|
`source` TEXT NOT NULL,
|
||||||
|
`target` TEXT NOT NULL
|
||||||
|
);''');
|
||||||
|
|
||||||
|
cursor.execute('''CREATE TABLE IF NOT EXISTS `queue` (
|
||||||
|
`id` INTEGER PRIMARY KEY AUTOINCREMENT UNIQUE,
|
||||||
|
`timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||||||
|
`source` TEXT NOT NULL,
|
||||||
|
`target` TEXT NOT NULL
|
||||||
|
);''');
|
||||||
|
self.db.commit()
|
||||||
|
|
||||||
|
def __exit__(self):
|
||||||
|
self.finish()
|
||||||
|
|
||||||
|
def finish(self):
|
||||||
|
self.db.close()
|
||||||
|
|
||||||
|
def queue(self, source, target):
|
||||||
|
cursor = self.db.cursor()
|
||||||
|
cursor.execute(
|
||||||
|
'''INSERT INTO queue (source,target) VALUES (?,?);''', (
|
||||||
|
source,
|
||||||
|
target
|
||||||
|
)
|
||||||
|
)
|
||||||
|
self.db.commit()
|
|
@ -1,30 +1,8 @@
|
||||||
aiofiles==0.3.1
|
|
||||||
appdirs==1.4.3
|
|
||||||
arrow==0.10.0
|
arrow==0.10.0
|
||||||
breadability==0.1.20
|
|
||||||
chardet==3.0.3
|
|
||||||
decorator==4.0.11
|
|
||||||
docopt==0.6.2
|
|
||||||
httptools==0.0.9
|
|
||||||
Jinja2==2.9.6
|
Jinja2==2.9.6
|
||||||
langdetect==1.0.7
|
langdetect==1.0.7
|
||||||
lxml==3.7.3
|
requests==2.12.4
|
||||||
MarkupSafe==1.0
|
requests-oauthlib==0.8.0
|
||||||
packaging==16.8
|
sanic==0.6.0
|
||||||
pyparsing==2.2.0
|
|
||||||
python-dateutil==2.6.0
|
|
||||||
python-frontmatter==0.4.2
|
|
||||||
python-magic==0.4.13
|
|
||||||
PyYAML==3.12
|
|
||||||
requests==2.14.2
|
|
||||||
sanic==0.5.4
|
|
||||||
similar-text==0.2.0
|
|
||||||
six==1.10.0
|
|
||||||
ujson==1.35
|
|
||||||
unicode-slugify==0.1.3
|
unicode-slugify==0.1.3
|
||||||
Unidecode==0.4.20
|
|
||||||
uvloop==0.8.0
|
|
||||||
validators==0.11.3
|
|
||||||
Wand==0.4.4
|
Wand==0.4.4
|
||||||
websockets==3.3
|
|
||||||
Whoosh==2.7.4
|
|
||||||
|
|
86
router.py
Normal file
86
router.py
Normal file
|
@ -0,0 +1,86 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
#import asyncio
|
||||||
|
#import uvloop
|
||||||
|
from sanic import Sanic
|
||||||
|
import sanic.response
|
||||||
|
import logging
|
||||||
|
import db
|
||||||
|
import shared
|
||||||
|
import validators
|
||||||
|
import urllib.parse
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
logging_format = "[%(asctime)s] %(process)d-%(levelname)s "
|
||||||
|
logging_format += "%(module)s::%(funcName)s():l%(lineno)d: "
|
||||||
|
logging_format += "%(message)s"
|
||||||
|
|
||||||
|
logging.basicConfig(
|
||||||
|
format=logging_format,
|
||||||
|
level=logging.DEBUG
|
||||||
|
)
|
||||||
|
log = logging.getLogger()
|
||||||
|
|
||||||
|
# log_config=None prevents creation of access_log and error_log files
|
||||||
|
# since I'm running this from systemctl it already goes into syslog
|
||||||
|
app = Sanic('router', log_config=None)
|
||||||
|
# this is ok to be read-only
|
||||||
|
sdb = db.SearchDB()
|
||||||
|
|
||||||
|
|
||||||
|
@app.route("/oauth1", methods=["GET"])
|
||||||
|
async def oauth1(request):
|
||||||
|
token = request.args.get('oauth_token')
|
||||||
|
verifier = request.args.get('oauth_verifier')
|
||||||
|
tokendb = shared.TokenDB()
|
||||||
|
tokendb.update_token(
|
||||||
|
token,
|
||||||
|
verifier=verifier
|
||||||
|
)
|
||||||
|
return sanic.response.text("OK",status=200)
|
||||||
|
|
||||||
|
|
||||||
|
@app.route("/search", methods=["GET"])
|
||||||
|
async def search(request):
|
||||||
|
query = request.args.get('s')
|
||||||
|
r = sdb.html(query)
|
||||||
|
response = sanic.response.html(r, status=200)
|
||||||
|
return response
|
||||||
|
|
||||||
|
|
||||||
|
@app.route("/micropub", methods=["POST","GET"])
|
||||||
|
async def micropub(request):
|
||||||
|
return sanic.response.text("Not Implemented", status=501)
|
||||||
|
|
||||||
|
|
||||||
|
@app.route("/webmention", methods=["POST"])
|
||||||
|
async def webmention(request):
|
||||||
|
source = request.form.get('source')
|
||||||
|
target = request.form.get('target')
|
||||||
|
|
||||||
|
# validate urls
|
||||||
|
if not validators.url(source):
|
||||||
|
return sanic.response.text('Invalide source url', status=400)
|
||||||
|
if not validators.url(target):
|
||||||
|
return sanic.response.text('Invalide target url', status=400)
|
||||||
|
|
||||||
|
# check if our site is actually the target for the webmention
|
||||||
|
_target = urllib.parse.urlparse(target)
|
||||||
|
if _target.hostname not in shared.config.get('site', 'domains'):
|
||||||
|
return sanic.response.text('target domain is not me', status=400)
|
||||||
|
|
||||||
|
# ignore selfpings
|
||||||
|
_source = urllib.parse.urlparse(source)
|
||||||
|
if _source.hostname in shared.config.get('site', 'domains'):
|
||||||
|
return sanic.response.text('selfpings are not allowed', status=400)
|
||||||
|
|
||||||
|
# it is unfortunate that I need to init this every time, but
|
||||||
|
# otherwise it'll become read-only for reasons I'm yet to grasp
|
||||||
|
# the actual parsing will be done at site generation time
|
||||||
|
wdb = db.WebmentionQueue()
|
||||||
|
wdb.queue(source,target)
|
||||||
|
response = sanic.response.text("Accepted", status=202)
|
||||||
|
return response
|
||||||
|
|
||||||
|
|
||||||
|
app.run(host="127.0.0.1",port=8008, log_config=None)
|
424
shared.py
424
shared.py
|
@ -5,131 +5,10 @@ import glob
|
||||||
import logging
|
import logging
|
||||||
import subprocess
|
import subprocess
|
||||||
import json
|
import json
|
||||||
import requests
|
import sqlite3
|
||||||
from urllib.parse import urlparse, urlunparse
|
|
||||||
|
|
||||||
from whoosh import fields
|
|
||||||
from whoosh import analysis
|
|
||||||
from slugify import slugify
|
from slugify import slugify
|
||||||
|
import jinja2
|
||||||
LLEVEL = {
|
|
||||||
'critical': 50,
|
|
||||||
'error': 40,
|
|
||||||
'warning': 30,
|
|
||||||
'info': 20,
|
|
||||||
'debug': 10
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def __expandconfig(config):
|
|
||||||
""" add the dirs to the config automatically """
|
|
||||||
basepath = os.path.expanduser(config.get('common','base'))
|
|
||||||
config.set('common', 'basedir', basepath)
|
|
||||||
for section in ['source', 'target']:
|
|
||||||
for option in config.options(section):
|
|
||||||
opt = config.get(section, option)
|
|
||||||
config.set(section, "%sdir" % option, os.path.join(basepath,opt))
|
|
||||||
config.set('target', 'filesdir', os.path.join(
|
|
||||||
config.get('target', 'builddir'),
|
|
||||||
config.get('source', 'files'),
|
|
||||||
))
|
|
||||||
config.set('target', 'commentsdir', os.path.join(
|
|
||||||
config.get('target', 'builddir'),
|
|
||||||
config.get('site', 'commentspath'),
|
|
||||||
))
|
|
||||||
return config
|
|
||||||
|
|
||||||
|
|
||||||
def baseN(num, b=36, numerals="0123456789abcdefghijklmnopqrstuvwxyz"):
|
|
||||||
""" Used to create short, lowercase slug for a number (an epoch) passed """
|
|
||||||
num = int(num)
|
|
||||||
return ((num == 0) and numerals[0]) or (
|
|
||||||
baseN(
|
|
||||||
num // b,
|
|
||||||
b,
|
|
||||||
numerals
|
|
||||||
).lstrip(numerals[0]) + numerals[num % b]
|
|
||||||
)
|
|
||||||
|
|
||||||
def slugfname(url):
|
|
||||||
return "%s" % slugify(
|
|
||||||
re.sub(r"^https?://(?:www)?", "", url),
|
|
||||||
only_ascii=True,
|
|
||||||
lower=True
|
|
||||||
)[:200]
|
|
||||||
|
|
||||||
ARROWISO = 'YYYY-MM-DDTHH:mm:ssZ'
|
|
||||||
STRFISO = '%Y-%m-%dT%H:%M:%S%z'
|
|
||||||
|
|
||||||
URLREGEX = re.compile(
|
|
||||||
r'\s+https?\:\/\/?[a-zA-Z0-9\.\/\?\:@\-_=#]+'
|
|
||||||
r'\.[a-zA-Z0-9\.\/\?\:@\-_=#]*'
|
|
||||||
)
|
|
||||||
|
|
||||||
EXIFREXEG = re.compile(
|
|
||||||
r'^(?P<year>[0-9]{4}):(?P<month>[0-9]{2}):(?P<day>[0-9]{2})\s+'
|
|
||||||
r'(?P<time>[0-9]{2}:[0-9]{2}:[0-9]{2})$'
|
|
||||||
)
|
|
||||||
|
|
||||||
MDIMGREGEX = re.compile(
|
|
||||||
r'(!\[(.*)\]\((?:\/(?:files|cache)'
|
|
||||||
r'(?:\/[0-9]{4}\/[0-9]{2})?\/(.*\.(?:jpe?g|png|gif)))'
|
|
||||||
r'(?:\s+[\'\"]?(.*?)[\'\"]?)?\)(?:\{(.*?)\})?)'
|
|
||||||
, re.IGNORECASE)
|
|
||||||
|
|
||||||
schema = fields.Schema(
|
|
||||||
url=fields.ID(
|
|
||||||
stored=True,
|
|
||||||
unique=True
|
|
||||||
),
|
|
||||||
category=fields.TEXT(
|
|
||||||
stored=True,
|
|
||||||
),
|
|
||||||
date=fields.DATETIME(
|
|
||||||
stored=True,
|
|
||||||
sortable=True
|
|
||||||
),
|
|
||||||
title=fields.TEXT(
|
|
||||||
stored=True,
|
|
||||||
analyzer=analysis.FancyAnalyzer()
|
|
||||||
),
|
|
||||||
weight=fields.NUMERIC(
|
|
||||||
sortable=True
|
|
||||||
),
|
|
||||||
img=fields.TEXT(
|
|
||||||
stored=True
|
|
||||||
),
|
|
||||||
content=fields.TEXT(
|
|
||||||
stored=True,
|
|
||||||
analyzer=analysis.FancyAnalyzer()
|
|
||||||
),
|
|
||||||
fuzzy=fields.NGRAMWORDS(
|
|
||||||
tokenizer=analysis.NgramTokenizer(4)
|
|
||||||
),
|
|
||||||
mtime=fields.NUMERIC(
|
|
||||||
stored=True
|
|
||||||
)
|
|
||||||
#slug=fields.NGRAMWORDS(
|
|
||||||
#tokenizer=analysis.NgramTokenizer(4)
|
|
||||||
#),
|
|
||||||
#reactions=fields.NGRAMWORDS(
|
|
||||||
#tokenizer=analysis.NgramTokenizer(4)
|
|
||||||
#),
|
|
||||||
#tags=fields.TEXT(
|
|
||||||
#stored=False,
|
|
||||||
#analyzer=analysis.KeywordAnalyzer(
|
|
||||||
#lowercase=True,
|
|
||||||
#commas=True
|
|
||||||
#),
|
|
||||||
#),
|
|
||||||
)
|
|
||||||
|
|
||||||
config = configparser.ConfigParser(
|
|
||||||
interpolation=configparser.ExtendedInterpolation(),
|
|
||||||
allow_no_value=True
|
|
||||||
)
|
|
||||||
config.read('config.ini')
|
|
||||||
config = __expandconfig(config)
|
|
||||||
|
|
||||||
class CMDLine(object):
|
class CMDLine(object):
|
||||||
def __init__(self, executable):
|
def __init__(self, executable):
|
||||||
|
@ -138,7 +17,6 @@ class CMDLine(object):
|
||||||
raise OSError('No %s found in PATH!' % executable)
|
raise OSError('No %s found in PATH!' % executable)
|
||||||
return
|
return
|
||||||
|
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def _which(name):
|
def _which(name):
|
||||||
for d in os.environ['PATH'].split(':'):
|
for d in os.environ['PATH'].split(':'):
|
||||||
|
@ -148,33 +26,6 @@ class CMDLine(object):
|
||||||
return None
|
return None
|
||||||
|
|
||||||
|
|
||||||
def __enter__(self):
|
|
||||||
self.process = subprocess.Popen(
|
|
||||||
[self.executable, "-stay_open", "True", "-@", "-"],
|
|
||||||
universal_newlines=True,
|
|
||||||
stdin=subprocess.PIPE,
|
|
||||||
stdout=subprocess.PIPE,
|
|
||||||
stderr=subprocess.PIPE
|
|
||||||
)
|
|
||||||
return self
|
|
||||||
|
|
||||||
|
|
||||||
def __exit__(self, exc_type, exc_value, traceback):
|
|
||||||
self.process.stdin.write("-stay_open\nFalse\n")
|
|
||||||
self.process.stdin.flush()
|
|
||||||
|
|
||||||
|
|
||||||
def execute(self, *args):
|
|
||||||
args = args + ("-execute\n",)
|
|
||||||
self.process.stdin.write(str.join("\n", args))
|
|
||||||
self.process.stdin.flush()
|
|
||||||
output = ""
|
|
||||||
fd = self.process.stdout.fileno()
|
|
||||||
while not output.endswith(self.sentinel):
|
|
||||||
output += os.read(fd, 4096).decode('utf-8', errors='ignore')
|
|
||||||
return output[:-len(self.sentinel)]
|
|
||||||
|
|
||||||
|
|
||||||
class Pandoc(CMDLine):
|
class Pandoc(CMDLine):
|
||||||
""" Pandoc command line call with piped in- and output """
|
""" Pandoc command line call with piped in- and output """
|
||||||
|
|
||||||
|
@ -254,23 +105,68 @@ class Pandoc(CMDLine):
|
||||||
return stdout.decode('utf-8').strip()
|
return stdout.decode('utf-8').strip()
|
||||||
|
|
||||||
|
|
||||||
class HeadlessChromium(CMDLine):
|
class ExifTool(CMDLine):
|
||||||
def __init__(self, url):
|
def __init__(self, fpath):
|
||||||
super().__init__('chromium-browser')
|
self.fpath = fpath
|
||||||
self.url = url
|
super().__init__('exiftool')
|
||||||
|
|
||||||
def get(self):
|
@staticmethod
|
||||||
|
def exifdate2iso(value):
|
||||||
|
""" converts and EXIF date string to ISO 8601 format
|
||||||
|
|
||||||
|
:param value: EXIF date (2016:05:01 00:08:24)
|
||||||
|
:type arg1: str
|
||||||
|
:return: ISO 8601 string with UTC timezone 2016-05-01T00:08:24+0000
|
||||||
|
:rtype: str
|
||||||
|
"""
|
||||||
|
if not isinstance(value, str):
|
||||||
|
return value
|
||||||
|
match = REGEX['exifdate'].match(value)
|
||||||
|
if not match:
|
||||||
|
return value
|
||||||
|
return "%s-%s-%sT%s+0000" % (
|
||||||
|
match.group('year'),
|
||||||
|
match.group('month'),
|
||||||
|
match.group('day'),
|
||||||
|
match.group('time')
|
||||||
|
)
|
||||||
|
|
||||||
|
def read(self):
|
||||||
cmd = (
|
cmd = (
|
||||||
self.executable,
|
self.executable,
|
||||||
'--headless',
|
'-sort',
|
||||||
'--disable-gpu',
|
'-json',
|
||||||
'--disable-preconnect',
|
'-MIMEType',
|
||||||
'--dump-dom',
|
'-FileType',
|
||||||
'--timeout 60',
|
'-FileName',
|
||||||
'--save-page-as-mhtml',
|
'-ModifyDate',
|
||||||
"%s" % self.url
|
'-CreateDate',
|
||||||
|
'-DateTimeOriginal',
|
||||||
|
'-ImageHeight',
|
||||||
|
'-ImageWidth',
|
||||||
|
'-Aperture',
|
||||||
|
'-FOV',
|
||||||
|
'-ISO',
|
||||||
|
'-FocalLength',
|
||||||
|
'-FNumber',
|
||||||
|
'-FocalLengthIn35mmFormat',
|
||||||
|
'-ExposureTime',
|
||||||
|
'-Copyright',
|
||||||
|
'-Artist',
|
||||||
|
'-Model',
|
||||||
|
'-GPSLongitude#',
|
||||||
|
'-GPSLatitude#',
|
||||||
|
'-LensID',
|
||||||
|
'-LensSpec',
|
||||||
|
'-Lens',
|
||||||
|
'-ReleaseDate',
|
||||||
|
'-Description',
|
||||||
|
'-Headline',
|
||||||
|
'-HierarchicalSubject',
|
||||||
|
self.fpath
|
||||||
)
|
)
|
||||||
logging.debug('getting URL %s with headless chrome', self.url)
|
|
||||||
|
logging.debug('reading EXIF from %s', self.fpath)
|
||||||
p = subprocess.Popen(
|
p = subprocess.Popen(
|
||||||
cmd,
|
cmd,
|
||||||
stdin=subprocess.PIPE,
|
stdin=subprocess.PIPE,
|
||||||
|
@ -280,113 +176,111 @@ class HeadlessChromium(CMDLine):
|
||||||
|
|
||||||
stdout, stderr = p.communicate()
|
stdout, stderr = p.communicate()
|
||||||
if stderr:
|
if stderr:
|
||||||
logging.error(
|
logging.error("Error reading EXIF:\n\t%s\n\t%s", cmd, stderr)
|
||||||
"Error getting URL:\n\t%s\n\t%s",
|
|
||||||
cmd,
|
exif = json.loads(stdout.decode('utf-8').strip()).pop()
|
||||||
stderr
|
if 'ReleaseDate' in exif and 'ReleaseTime' in exif:
|
||||||
|
exif['DateTimeRelease'] = "%s %s" % (exif.get('ReleaseDate'), exif.get('ReleaseTime')[:8])
|
||||||
|
del(exif['ReleaseDate'])
|
||||||
|
del(exif['ReleaseTime'])
|
||||||
|
|
||||||
|
for k, v in exif.items():
|
||||||
|
exif[k] = self.exifdate2iso(v)
|
||||||
|
|
||||||
|
return exif
|
||||||
|
|
||||||
|
def __expandconfig():
|
||||||
|
c = configparser.ConfigParser(
|
||||||
|
interpolation=configparser.ExtendedInterpolation(),
|
||||||
|
allow_no_value=True
|
||||||
)
|
)
|
||||||
return stdout.decode('utf-8').strip()
|
c.read('config.ini')
|
||||||
|
|
||||||
|
for s in c.sections():
|
||||||
|
for o in c.options(s):
|
||||||
|
curr = c.get(s, o)
|
||||||
|
if 'photo' == s and 'regex' == o:
|
||||||
|
REGEX.update({'photo': re.compile(curr)})
|
||||||
|
c.set(s, o, os.path.expanduser(curr))
|
||||||
|
|
||||||
class wget(CMDLine):
|
def baseN(num, b=36, numerals="0123456789abcdefghijklmnopqrstuvwxyz"):
|
||||||
def __init__(self, url, dirname=None):
|
""" Used to create short, lowercase slug for a number (an epoch) passed """
|
||||||
super().__init__('wget')
|
num = int(num)
|
||||||
self.url = url
|
return ((num == 0) and numerals[0]) or (
|
||||||
self.slug = dirname or slugfname(self.url)
|
baseN(
|
||||||
self.saveto = os.path.join(
|
num // b,
|
||||||
config.get('source', 'offlinecopiesdir'),
|
b,
|
||||||
self.slug
|
numerals
|
||||||
|
).lstrip(numerals[0]) + numerals[num % b]
|
||||||
)
|
)
|
||||||
|
|
||||||
def archive(self):
|
|
||||||
cmd = (
|
def slugfname(url):
|
||||||
self.executable,
|
return "%s" % slugify(
|
||||||
'-e',
|
re.sub(r"^https?://(?:www)?", "", url),
|
||||||
'robots=off',
|
only_ascii=True,
|
||||||
'--timeout=360',
|
lower=True
|
||||||
'--no-clobber',
|
)[:200]
|
||||||
'--no-directories',
|
|
||||||
'--adjust-extension',
|
def __setup_sitevars():
|
||||||
'--span-hosts',
|
SiteVars = {}
|
||||||
'--wait=1',
|
section = 'site'
|
||||||
'--random-wait',
|
for o in config.options(section):
|
||||||
'--convert-links',
|
SiteVars.update({o: config.get(section, o)})
|
||||||
#'--backup-converted',
|
|
||||||
'--page-requisites',
|
# add site author
|
||||||
'--directory-prefix=%s' % self.saveto,
|
section = 'author'
|
||||||
"%s" % self.url
|
SiteVars.update({section: {}})
|
||||||
)
|
for o in config.options(section):
|
||||||
logging.debug('getting URL %s with wget', self.url)
|
SiteVars[section].update({o: config.get(section, o)})
|
||||||
p = subprocess.Popen(
|
|
||||||
cmd,
|
# add extra sections to author
|
||||||
stdin=subprocess.PIPE,
|
for sub in config.get('author', 'appendwith').split():
|
||||||
stdout=subprocess.PIPE,
|
SiteVars[section].update({sub: {}})
|
||||||
stderr=subprocess.PIPE,
|
for o in config.options(sub):
|
||||||
|
SiteVars[section][sub].update({o: config.get(sub, o)})
|
||||||
|
|
||||||
|
# push the whole thing into cache
|
||||||
|
return SiteVars
|
||||||
|
|
||||||
|
|
||||||
|
ARROWFORMAT = {
|
||||||
|
'iso': 'YYYY-MM-DDTHH:mm:ssZ',
|
||||||
|
'display': 'YYYY-MM-DD HH:mm'
|
||||||
|
}
|
||||||
|
|
||||||
|
LLEVEL = {
|
||||||
|
'critical': 50,
|
||||||
|
'error': 40,
|
||||||
|
'warning': 30,
|
||||||
|
'info': 20,
|
||||||
|
'debug': 10
|
||||||
|
}
|
||||||
|
|
||||||
|
REGEX = {
|
||||||
|
'exifdate': re.compile(
|
||||||
|
r'^(?P<year>[0-9]{4}):(?P<month>[0-9]{2}):(?P<day>[0-9]{2})\s+'
|
||||||
|
r'(?P<time>[0-9]{2}:[0-9]{2}:[0-9]{2})$'
|
||||||
|
),
|
||||||
|
'cleanurl': re.compile(r"^https?://(?:www)?"),
|
||||||
|
'urls': re.compile(
|
||||||
|
r'\s+https?\:\/\/?[a-zA-Z0-9\.\/\?\:@\-_=#]+'
|
||||||
|
r'\.[a-zA-Z0-9\.\/\?\:@\-_=#]*'
|
||||||
|
),
|
||||||
|
'mdimg': re.compile(
|
||||||
|
r'(?P<shortcode>\!\[(?P<alt>[^\]]+)\]\((?P<fname>[^\s]+)'
|
||||||
|
r'(?:\s[\'\"](?P<title>[^\"\']+)[\'\"])?\)(?:\{(?P<css>[^\}]+)\})?)',
|
||||||
|
re.IGNORECASE
|
||||||
)
|
)
|
||||||
|
}
|
||||||
|
|
||||||
stdout, stderr = p.communicate()
|
config = __expandconfig()
|
||||||
if stderr:
|
|
||||||
logging.error(
|
|
||||||
"Error getting URL:\n\t%s\n\t%s",
|
|
||||||
cmd,
|
|
||||||
stderr
|
|
||||||
)
|
|
||||||
return stdout.decode('utf-8').strip()
|
|
||||||
|
|
||||||
def find_realurl(url):
|
j2 = jinja2.Environment(
|
||||||
headers = requests.utils.default_headers()
|
loader=jinja2.FileSystemLoader(
|
||||||
headers.update({
|
searchpath=config.get('dirs', 'tmpl')
|
||||||
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
|
),
|
||||||
})
|
lstrip_blocks=True
|
||||||
|
)
|
||||||
|
|
||||||
try:
|
site = __setup_sitevars()
|
||||||
r = requests.get(
|
|
||||||
url,
|
|
||||||
allow_redirects=True,
|
|
||||||
timeout=60,
|
|
||||||
headers=headers
|
|
||||||
)
|
|
||||||
except Exception as e:
|
|
||||||
logging.error('getting real url failed: %s', e)
|
|
||||||
return (None, 400)
|
|
||||||
|
|
||||||
finalurl = list(urlparse(r.url))
|
|
||||||
finalurl[4] = '&'.join(
|
|
||||||
[x for x in finalurl[4].split('&') if not x.startswith('utm_')])
|
|
||||||
finalurl = urlunparse(finalurl)
|
|
||||||
|
|
||||||
return (finalurl, r.status_code)
|
|
||||||
|
|
||||||
def find_archiveorgurl(url):
|
|
||||||
url, status = find_realurl(url)
|
|
||||||
if status == requests.codes.ok:
|
|
||||||
return url
|
|
||||||
|
|
||||||
try:
|
|
||||||
a = requests.get(
|
|
||||||
"http://archive.org/wayback/available?url=%s" % url,
|
|
||||||
)
|
|
||||||
except Exception as e:
|
|
||||||
logging.error('Failed to fetch archive.org availability for %s' % url)
|
|
||||||
return None
|
|
||||||
|
|
||||||
if not a:
|
|
||||||
logging.error('empty archive.org availability for %s' % url)
|
|
||||||
return None
|
|
||||||
|
|
||||||
try:
|
|
||||||
a = json.loads(a.text)
|
|
||||||
aurl = a.get(
|
|
||||||
'archived_snapshots', {}
|
|
||||||
).get(
|
|
||||||
'closest', {}
|
|
||||||
).get(
|
|
||||||
'url', None
|
|
||||||
)
|
|
||||||
if aurl:
|
|
||||||
logging.debug("found %s in archive.org for %s", aurl, url)
|
|
||||||
return aurl
|
|
||||||
except Exception as e:
|
|
||||||
logging.error("archive.org parsing failed: %s", e)
|
|
||||||
|
|
||||||
return None
|
|
||||||
|
|
Loading…
Reference in a new issue