Sunday, November 29, 2009

spin detector

Unlike the previous Wikipedia spin detector, i'd like to make a web tool that can internalize any web page or editorial content for spin or bias towards a particular subject matter or ideology.

The mechanism should be simple. Isolate words and phrases which indicate a clear bias towards a particular opinion and assign them 'tags'. The tags will indicate whether something is more liberal or conservative, republican or democrat, racist or politically correct, patriotic or revolutionary, etc. A series of filters can first identify the spin words individually and group them in increasingly broader groups. Eventually the page can be highlighted with context to various large groupings of content that has a greater affinity towards a certain subject type or bias and summaries can be generated.

Certainly this could be abused to the point people make snap judgments or rule out content based on its perceived bias, but at the same time one could use the tool on a large swath of content and determine in a general way whether it had a majority of its content leaning in one way or another. I'd partly like to use this to flag content that I don't want to read, but also as an indicator of whether something is filled with cruft. I don't care if it's left-leaning or right-leaning as much as I just want to read unbiased opinions not littered with mindless rhetoric.

http/socks proxies and ssh tunneling

It seems there exists no tool to simply convert one type of tunnel into another. SSH supports both tcp port forwarding and a built-in SOCKS proxy, both of which are incredibly useful. But it lacks a native HTTP proxy. The sad truth is, most applications today only seem to support HTTP proxies or are beginning to support SOCKS. Until such time that SOCKS proxies are universally accepted in networking apps, I need an HTTP proxy for my SSH client.

There are some LD_PRELOAD apps which will overload network operations with their own and send them through a proxy (like ProxyChains). This seems an unportable and more hacky solution than I would use (are you going to change all your desktop links for 'audacious' to be prefixed with 'proxychains'?). I am willing to write a HTTP-to-SOCKS proxy but don't have the time just yet. CPAN seems to have a pure-perl HTTP proxy server, and I can probably leverage IO::Socket::Socks to connect to the SOCKS5 server in ssh.

In any case, my immediate needs are fulfilled: I can use 'curl' to tunnel through ssh's SOCKS for most of my needs. This blog entry on tunneling svn was a useful quick hack to commit my changes through an ssh forwarded port, but much more of a hack than i'm willing to commit to; i'd rather just enable or disable an http proxy.

Sunday, November 22, 2009

checkup: a minimal configuration management tool

This is a braindump of an idea. I want a minimal, simple 'tool' to manage my configuration across multiple hosts. I have configs I share between work and home and my colo box, all with minor differences. Cfengine and Puppet take too much time for me to set them up properly, so i'll just write something which will take slightly longer but be simpler in the end. added Above all, this tool should not make silly assumptions for you about how you want to use it or what it should do - it should just do what you tell it to without needing to know finicky magic or special parameters/syntax.

I already have subversion set up so i'll stick with that, otherwise i'd use cvs. I need the following functionality:

* auto-update sources (run 'svn up' for me)
* check if destination is up to date, and if not, apply configuration
* search-and-replace content in files
* try to maintain state of system services
* send an alert if an error occurs
* added run an external command to perform some action or edit a file post-delivery

That's about it for now. Mostly I just want it to copy files based on the host "class" (colo, home, laptop, work). Since I want it to run repeatedly the alert is only because it'll be backgrounded; otherwise it will obviously report an error on stderr. It should also be able to run on cmdline, background, etc and lock itself appropriately. Logging all actions taken will be crucial to making sure shit is working and debugging errors.

I don't want it to turn into a full-fledged system monitoring agent. The maintain state of system services is more of a "check if X service is enabled, if not enable it" thing, not running sendmail if sendmail isn't running. On Slack this is about as complicated as "is the rc.d file there? is it executable?" but on systems like Red Hat it's more like "does chkconfig show this as enabled for my runlevel?". I don't know what tool i'll use to monitor services; I need one but it isn't in scope of this project.

Now for a name... it's going to be checking that my configuration is sane, so let's go with "checkup". There doesn't seem to be an open source name conflict.

For config files i'm usually pretty easygoing. Lately i've been digging on .ini files so i'll continue the trend. To make the code more flexible to change i'll make sections match subroutines, so I can just make a new .ini section and subroutine any time I wanna expand functionality. Syntax will be straightforward and plain-english, no strange punctuation unless it makes it more readable. Multiple files will be supported, though if a file fails syntax check by default it'll be ignored and the rest of the files will be scanned and a warning thrown. If any file references anything which is missing an error will be thrown and exit status non-zero. Each section will also be free-form: the contents of the section determines what it's doing. If it defines hostnames and a hostgroup name, the name of the section should probably implicitly be a hostgroup class name. A list of files and their permissions, etc would mean that section name imposes those restrictions on those files. A set of simple logic conditionals will determine if that class can be evaluated (ex. "logic = if $host == 'peteslaptop'"; there is no "then blah blah" because this is just a conditional to be evaluated, and if it's false that class is ignored). added The conditionals will evaluate left-to-right and include 'and' and 'or' chaining (i don't know what it's actually called in a language). While we're at it, a regex is probably acceptable to include here ('=~' in addition to '=='). Hey, it is written in Perl :)

added The tool should be able to run by a normal user as any other unix tool is. It should be able to be fed configs via stdin for example. It will not serve files as a daemon as that is completely out of scope of the tool; in fact, it shouldn't really do any network operations at all save for tasks performed as part of the configs. In addition, all operations should be performed locally without the thought or need to retrieve files or other information - all updates should happen first before any configs are parsed. If for some reason that poses a scalability or other problem we may allow configs to be parsed, then files copied to local disk, then examining the system to execute the configs etc.

The only thing i'm not really sure about is roll-back. I always want roll-back, but it's hard to figure out how to perform such a thing when you've basically just got a lot of rules that say how stuff should be configured at a glance - not how it should look in X time or whatever. Rolling back a change to the tool's configs does not necessarily mean your system will end up that way, unless you wrote your tool's configs explicitly so they always overwrite the current setting. For example, you might have a command that edits a file in place - how will you be able to verify that edit is correct in the future and how would you be able to go back to what it was before you made your edit?

Probably the easiest way to get at this would be to back up everything - make a record of exactly the state of something before you change it and make a backup copy. stat() it, then if it's a file or directory mark that you'll back it up before doing a write operation. Really the whole system should determine exactly what it's going to do before it does it instead of executing things as it goes down the parsed config.

Let's say you've got your configs and you're running an update. One rule says to create file A and make sure it's empty. The next rule says to copy something into file B. The next rule says to copy file B into file A. Well wait a minute - we already did an operation on file A. What the fuck is wrong with you? Was this overwrite intentional or not? Why have the rule to create file A if you were just going to copy file B on top of it? This is where the 'safety' modes come in. In 'paranoid' mode, any operation such as this results in either a warning or an error (you should be able to set it to one or the other globally). Alternatively there will be a 'trusting' mode wherein anything that happens is assumed to happen for a reason. For operations in 'trusting' mode, if there is a rule which explicitly conflicts with another - you told it to set permissions to 0644 in one rule and 0755 in another rule - a warning will be emitted, but not nearly as many as in 'paranoid warning' mode. Of course all of this will be separate from the inherent logging in any test or operation which is performed for debugging purposes.

All this is a bit grandiose already for what I need, but might as well put it in now than have to add it later.

added The config's sections have the ability to be "include"'d into other sections to provide defaults or a set of instructions where they are included. In this way we can modularize or reuse sections. In order to make 'paranoid mode' above happy we may also need to add a set of commands to perform basic I/O options. Also we should be able to define files or directories which should be explicitly backed-up before an operation (such as a mysterious 'exec' call) might modify them. Of course different variables and data types may be necessary. Besides the global '$VARIABLE' data type, we may need to provide lists or arrays of data and be able to reproduce it within a section. '@ARRAY' is one obvious choice, though for data which may apply only to a given section at evaluation-time (such as an array that changes each time it is evaluated) we may need a more specific way to specify that data type and its values.

added The initial configuration layout has changed to a sort of filesystem overlay. The idea is to lay out your configs in the same place as they'd be on the filesystem of your target host(s) and place checkup configs in directories which will determine how files will be delivered. I started out with a puppet-like breakout of configs and templates, modularizing everything etc. But it quickly became tedious to figure out the best way to organize everything. Putting it all in the place you'd expect it on disk is the simplest. You want an apache config? Go edit checkup/filesystem/etc/apache/ files. You want to set up a user's home directory configs? Go to checkup/filesystem/home/user/. Just edit a .checkuprc in the directory you want and populate files as they'd be on the filesystem (your checkuprc will have to reference them for them to be copied over, though).

Thursday, November 19, 2009

repeating your mistakes

We ran into an issue at work again where poor planning ended up biting us in the ass. The computer does not have bugs - the program written by the human has bugs. In this case our monitoring agent couldn't send alerts from individual hosts because the MTA wasn't running, and we had no check to ensure the MTA was running.

This should have been fixed in the past. When /var would fill up, the MTA couldn't deliver mail. We added checks to alert before /var fills up (which is really stupid if you ask me; create a file and seek to the end of the filesystem and write something and /var is filled up, so it's possible this alert could be missed too).

So the fix here is to add a check on another host if the MTA isn't running. Great. Now we just need to assume nothing else prevents the MTA from delivering the message and we're all good. But what's the alternative? Remote syslog and a remote check to see if the host is down and when it's back up determine why it was down & to reap the unreceived syslog entry? I could be crazy, but something based on Spread seems a little more lightweight and just about as reliable, though because you're removing the requirement of a mail spool (you keep the logs on the client if it can't deliver the message) it reduces the complexity a tad.

At the end of the day we should have learned from our mistake the first time. Somebody should have sat down and thought of all the ways we may miss alerts in the future and work out solutions to them, document it and assign someone to implement it. But our architect didn't work this way and now we lack any architect. Nobody is tending the light and we're doomed to repeat our mistakes over and over.

Also we shouldn't have reinvented a whole monitoring agent when cron scripts, Spread (or collectd) and Nagios could maintain alerts just as well and a lot easier/quicker.

quick and dirty sandboxing using unionfs

On occasion i've wanted to perform some dev work using a base system and not care what happens to the system. Usually VM images are the easiest way to do this; keep a backup copy of the image and overwrite the new one when you want to go back. But what about debugging? And diffing changes? Using a union filesystem overlay you can keep a base system and copy its writes to a separate location without affecting the base system.

Herein lies a guide to setting up a union sandbox for development purposes using unionfs-fuse. This is the quickest, dirtiest way to perform operations in a sandbox which will not effect the base system. All writes will end up in a single directory which can be cleaned between uses. With debugging enabled one can see any writeable actions that take place in the sandbox, thus allowing for a more fine-grained look at the effects of an application on a system.

Note that unionfs-fuse is not as production-ready as a kernel mode unionfs (aufs is an alternative) but this method does not require kernel patching. Also note that this system may provide unexpected results on a "root" filesystem.

Also note that this guide is for a basic 'chroot' environment. The process table and devices are shared with the host system, so anything done by a process could kill the host system's processes or damage hardware. Always use caution when in a chroot environment. A safer method is replicating the sandbox in a LiveDVD with writes going to a tmpfs filesystem. The image could be booted from VMware to speed development.

Unfortunately it seems like the current unionfs-fuse does not handle files which need to be mmap()'d. A kernel solution may be a better long-term fix, but for the short term there is a workaround included below.

  1. set up unionfs
     # Make sure kernel-* is not excluded from yum.conf
    yum -y install kernel-devel dkms dkms-fuse fuse fuse-devel
    /etc/init.d/fuse start
    yum -y install fuse-unionfs

  2. cloning a build box
     mkdir sandbox
    cd sandbox
    rsync --progress -a /.autofsck /.autorelabel /.bash_history /bin /boot /dev /etc \
    /home /lib /lib64 /mnt /opt /sbin /selinux /srv /usr /var .
    mkdir proc sys tmp spln root
    chmod 1777 tmp
    cd ..

  3. setting up the unionfs
     mkdir writes mount
    unionfs -o cow -o noinitgroups -o default_permissions -o allow_other -o use_ino \
    -o nonempty `pwd`/writes=RW:`pwd`/sandbox=RO `pwd`/mount

  4. using the sandbox
     mount -o bind /proc `pwd`/mount/proc
    mount -o bind /sys `pwd`/mount/sys
    mount -o bind /dev `pwd`/mount/dev
    mount -t devpts none `pwd`/mount/dev/pts
    mount -t tmpfs none `pwd`/mount/dev/shm
    chroot `pwd`/mount /bin/bash --login

  5. handling mmap()'d files
     mkdir mmap-writes
    cp -a --parents sandbox/var/lib/rpm mmap-writes/
    mount -o bind `pwd`/mmap-writes/sandbox/var/lib/rpm `pwd`/mount/var/lib/rpm

Wednesday, November 4, 2009

Bad excuses for bad security

This document explains Pidgin's policies on storing passwords. In effect, they are:

  • Most passwords sent over IM services are plain-text, so a man-in-the-middle can sniff them.
  • Other IM clients are equally insecure.
  • You should not save the password at all because then nobody can attempt to decipher it.
  • Obfuscating passwords isn't secure. (Even though a real encrypted stored password isn't obfuscation)
  • You shouldn't store sensitive data if there is a possibility someone might try to access it.
  • It won't kill you to type your password every time you log in.
  • We would rather you use a "desktop keyring" which isn't even portable or finished being written yet.

These explanations are really a verbose way of saying "we don't feel like implementing good security." I've been using Mozilla and Thunderbird for years now with a master password, which works similarly to a desktop keychain.

The idea is simple: encrypt the passwords in a database with a central key. When the application is opened or a login is attempted, ask the user for the master password. If it is correct, unlock the database and get the credentials you need. This way only the user of the current session of the application can access the stored passwords.

What you gain is "security on disk": that is, the data on the hard drive is secure. There are still plenty of ways to extract the passwords from a running system, but if the system is compromised it's less likely an attacker can get the password if they had to extract it from disk. This is most useful for laptops and corporate workstations where you don't necessarily control access to the hard drive.

Policies like the one described above should not be tolerated in the open-source community. It's clear to anyone who actually cares about the integrity of their data that these developers are simply refusing to implement a modicum of good security because they have issues with people's perception of security. I don't agree with obfuscating passwords - if you're just scrambling it on the disk without a master password that's no security at all. But a master password allows true encryption of the password database and thus secures the data on disk.

It would be nice if we could all have encrypted hard drives and encrypted home directories. Alas, not every environment is so flexible.