A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
.csv
和 .tsv
文件执行类似 SQL 的查询:q - Text as Data。
-bash: q: command not found #Fedora dnf install q #OS X brew install q #RPM Package: run rpm -ivh https://github.com/harelba/packages-for-q/raw/master/rpms/q-text-as-data-2.0.9-1.x86_64.rpm rpm -U https://github.com/harelba/packages-for-q/raw/master/rpms/q-text-as-data-2.0.9-1.x86_64.rpm #DEB Package: sudo dpkg -i https://github.com/harelba/packages-for-q/raw/master/deb/q-text-as-data_2.0.9-2_amd64.deb #Windows Installer 下载地址: https://github.com/harelba/packages-for-q/raw/master/windows/setup-q-2.0.9.exe #tar.gz: https://github.com/harelba/q/archive/2.0.9.tar.gz #zip: https://github.com/harelba/q/archive/2.0.9.zip
q 是一个命令行工具,可以直接在 CSV/TSV
(和任何其他表格文本文件)上执行类似 SQL 的查询。
q 将普通文件视为数据库表,并支持所有 SQL 构造,例如 WHERE
、GROUP BY
、JOIN
等。它支持自动列名和列类型检测,并完全支持多种编码。
非英语用户: q 完全支持所有类型的编码。使用 -e data-encoding
设定输入数据编码,-Q query-encoding
设置查询编码,并且使用 -E output-encoding
设置输出编码。所有这三个参数都有合理的默认值。如果您遇到任何问题,请联系 q 命令开发者。
具有 BOM 表的文件: Python 的 csv 模块内部未正确支持包含 BOM 表(字节顺序标记 - By Order Mark)的文件。q 命令包含一种变通办法,该办法允许读取包含 BOM 的 UTF8
文件 -e utf-8-sig
方式,这将支持所有编码的BOM表。
q <flags> "<query>" q 命令最简单的执行为: `q "SELECT * FROM myfile"` 将列出 myfile 文档所有内容.
通过将定界符指定为 , 来查询 .csv
文件:
q -d',' "SELECT * from path/to/file"
查询 .tsv
文件:
q -ddelimiter -H "SELECT * from path/to/file"
查询 clicks.csv
文件里,UUID 的排重数量:
q -H -t "SELECT COUNT(DISTINCT(uuid)) FROM ./clicks.csv" 229 <--输出共有229个
带 HHERE 条件查询 clicks.csv
:
q -H -t "SELECT request_id,score FROM ./clicks.csv WHERE score > 0.7 ORDER BY score DESC LIMIT 5" 2cfab5ceca922a1a2179dc4687a3b26e 1.0 f6de737b5aa2c46a3db3208413a54d64 0.986665809568 766025d25479b95a224bd614141feee5 0.977105183282 2c09058a1b82c6dbcf9dc463e73eddd2 0.703255121794
从标准输入读取输入,计算 /tmp
subtree 中每个用户/组的总大小。
sudo find /tmp -ls | q "SELECT c5,c6,sum(c7)/1024.0/1024 AS total FROM - GROUP BY c5,c6 ORDER BY total desc" mapred hadoop 304.00390625 root root 8.0431451797485 smith smith 4.34389972687
q 命令的关联查询:
q "SELECT myfiles.c8,emails.c2 FROM exampledatafile myfiles JOIN group-emails-example emails ON (myfiles.c4 = emails.c1) WHERE myfiles.c8 = 'ppp'" ppp dip.1@otherdomain.com ppp dip.2@otherdomain.com
使用标题行中的列名,计算拥有的进程数最多的前 3 个用户ID,并按降序排列。请注意查询中自动检测到的列名 UID
的用法。
ps -ef | q -H "SELECT UID,COUNT(*) cnt FROM - GROUP BY UID ORDER BY cnt DESC LIMIT 3" #Output: root 152 harel 119 avahi 2018
Usage: q allows performing SQL-like statements on tabular text data. Its purpose is to bring SQL expressive power to manipulating text data using the Linux command line. Basic usage is q "<sql-like query>" where table names are just regular file names (Use - to read from standard input) When the input contains a header row, use -H, and column names will be set according to the header row content. If there isn't a header row, then columns will automatically be named c1..cN. Column types are detected automatically. Use -A in order to see the column name/type analysis. Delimiter can be set using the -d (or -t) option. Output delimiter can be set using -D All sqlite3 SQL constructs are supported. Examples: Example 1: ls -ltrd * | q "select c1,count(1) from - group by c1" This example would print a count of each unique permission string in the current folder. Example 2: seq 1 1000 | q "select avg(c1),sum(c1) from -" This example would provide the average and the sum of the numbers in the range 1 to 1000 Example 3: sudo find /tmp -ls | q "select c5,c6,sum(c7)/1024.0/1024 as total from - group by c5,c6 order by total desc" This example will output the total size in MB per user+group in the /tmp subtree See the help or https://github.com/harelba/q/ for more details. Options: -h, --help show this help message and exit -v, --version Print version -V, --verbose Print debug info in case of problems -S SAVE_DB_TO_DISK_FILENAME, --save-db-to-disk=SAVE_DB_TO_DISK_FILENAME Save database to an sqlite database file --save-db-to-disk-method=SAVE_DB_TO_DISK_METHOD Method to use to save db to disk. 'standard' does not require any deps, 'fast' currenty requires manually running `pip install sqlitebck` on your python installation. Once packing issues are solved, the fast method will be the default. Input Data Options: -H, --skip-header Skip header row. This has been changed from earlier version - Only one header row is supported, and the header row is used for column naming -d DELIMITER, --delimiter=DELIMITER Field delimiter. If none specified, then space is used as the delimiter. -t, --tab-delimited Same as -d <tab>. Just a shorthand for handling standard tab delimited file You can use $'\t' if you want (this is how Linux expects to provide tabs in the command line -e ENCODING, --encoding=ENCODING Input file encoding. Defaults to UTF-8. set to none for not setting any encoding - faster, but at your own risk... -z, --gzipped Data is gzipped. Useful for reading from stdin. For files, .gz means automatic gunzipping -A, --analyze-only Analyze sample input and provide information about data types -m MODE, --mode=MODE Data parsing mode. fluffy, relaxed and strict. In strict mode, the -c column-count parameter must be supplied as well -c COLUMN_COUNT, --column-count=COLUMN_COUNT Specific column count when using relaxed or strict mode -k, --keep-leading-whitespace Keep leading whitespace in values. Default behavior strips leading whitespace off values, in order to provide out-of-the-box usability for simple use cases. If you need to preserve whitespace, use this flag. --disable-double-double-quoting Disable support for double double-quoting for escaping the double quote character. By default, you can use "" inside double quoted fields to escape double quotes. Mainly for backward compatibility. --disable-escaped-double-quoting Disable support for escaped double-quoting for escaping the double quote character. By default, you can use \" inside double quoted fields to escape double quotes. Mainly for backward compatibility. --as-text Don't detect column types - All columns will be treated as text columns -w INPUT_QUOTING_MODE, --input-quoting-mode=INPUT_QUOTING_MODE Input quoting mode. Possible values are all, minimal and none. Note the slightly misleading parameter name, and see the matching -W parameter for output quoting. -M MAX_COLUMN_LENGTH_LIMIT, --max-column-length-limit=MAX_COLUMN_LENGTH_LIMIT Sets the maximum column length. -U, --with-universal-newlines Expect universal newlines in the data. Limitation: -U works only with regular files for now, stdin or .gz files are not supported yet. Output Options: -D OUTPUT_DELIMITER, --output-delimiter=OUTPUT_DELIMITER Field delimiter for output. If none specified, then the -d delimiter is used if present, or space if no delimiter is specified -T, --tab-delimited-output Same as -D <tab>. Just a shorthand for outputting tab delimited output. You can use -D $'\t' if you want. -O, --output-header Output header line. Output column-names are determined from the query itself. Use column aliases in order to set your column names in the query. For example, 'select name FirstName,value1/value2 MyCalculation from ...'. This can be used even if there was no header in the input. -b, --beautify Beautify output according to actual values. Might be slow... -f FORMATTING, --formatting=FORMATTING Output-level formatting, in the format X=fmt,Y=fmt etc, where X,Y are output column numbers (e.g. 1 for first SELECT column etc. -E OUTPUT_ENCODING, --output-encoding=OUTPUT_ENCODING Output encoding. Defaults to 'none', leading to selecting the system/terminal encoding -W OUTPUT_QUOTING_MODE, --output-quoting-mode=OUTPUT_QUOTING_MODE Output quoting mode. Possible values are all, minimal, nonnumeric and none. Note the slightly misleading parameter name, and see the matching -w parameter for input quoting. Query Related Options: -q QUERY_FILENAME, --query-filename=QUERY_FILENAME Read query from the provided filename instead of the command line, possibly using the provided query encoding (using -Q). -Q QUERY_ENCODING, --query-encoding=QUERY_ENCODING query text encoding. Experimental. Please send your feedback on this