GitHub 데이터를 사용하여 ClickHouse에서 쿼리 작성하기

이 데이터세트에는 ClickHouse 저장소의 모든 커밋과 변경 사항이 포함됩니다. ClickHouse와 함께 배포되는 네이티브 git-import 도구를 사용하여 생성할 수 있습니다.

생성된 데이터는 다음 각 테이블마다 하나의 tsv 파일을 제공합니다:

commits - 통계 정보를 포함한 커밋.
file_changes - 각 커밋에서 변경된 파일과 해당 변경 정보 및 통계 정보.
line_changes - 각 커밋에서 변경된 모든 파일의 변경된 모든 행에 대해, 해당 행에 대한 전체 정보와 이 행의 이전 변경에 대한 정보를 포함한 데이터.

2022년 11월 8일 기준으로, 각 TSV의 대략적인 크기와 행 수는 다음과 같습니다:

commits - 7.8M - 266,051 행
file_changes - 53M - 266,051 행
line_changes - 2.7G - 7,535,157 행

데이터 생성

이 단계는 선택 사항입니다. 데이터는 자유롭게 제공됩니다. 자세한 내용은 데이터 다운로드 및 삽입을 참고하십시오.

git clone git@github.com:ClickHouse/ClickHouse.git
cd ClickHouse
clickhouse git-import --skip-paths 'generated\.cpp|^(contrib|docs?|website|libs/(libcityhash|liblz4|libdivide|libvectorclass|libdouble-conversion|libcpuid|libzstd|libfarmhash|libmetrohash|libpoco|libwidechar_width))/' --skip-commits-with-messages '^Merge branch '

ClickHouse 저장소를 대상으로 실행할 경우 작업을 완료하는 데 약 3분 정도 소요됩니다(2022년 11월 8일 기준, MacBook Pro 2021에서 측정).

사용 가능한 전체 옵션 목록은 도구의 내장 도움말에서 확인할 수 있습니다.

clickhouse git-import -h

이 도움말에서는 위의 각 테이블에 대한 DDL도 제공합니다. 예를 들어,

CREATE TABLE git.commits
(
    hash String,
    author LowCardinality(String),
    time DateTime,
    message String,
    files_added UInt32,
    files_deleted UInt32,
    files_renamed UInt32,
    files_modified UInt32,
    lines_added UInt32,
    lines_deleted UInt32,
    hunks_added UInt32,
    hunks_removed UInt32,
    hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;

이 쿼리는 어떤 리포지토리에서나 실행할 수 있습니다. 자유롭게 탐색해 보고 분석 결과를 공유하십시오. 2022년 11월 기준 실행 시간에 대한 몇 가지 가이드라인은 다음과 같습니다.

Linux - ~/clickhouse git-import - 160분

데이터 다운로드 및 삽입

다음 데이터는 동작 환경을 재현하는 데 사용할 수 있습니다. 또한 이 데이터셋은 play.clickhouse.com에서도 사용할 수 있습니다. 자세한 내용은 Queries를 참조하십시오.

다음 리포지토리용으로 생성된 파일은 아래와 같습니다.

ClickHouse (2022년 11월 8일)
Linux (2022년 11월 8일)

이 데이터를 삽입하려면 다음 쿼리를 실행하여 데이터베이스를 준비하십시오:

DROP DATABASE IF EXISTS git;
CREATE DATABASE git;

CREATE TABLE git.commits
(
    hash String,
    author LowCardinality(String),
    time DateTime,
    message String,
    files_added UInt32,
    files_deleted UInt32,
    files_renamed UInt32,
    files_modified UInt32,
    lines_added UInt32,
    lines_deleted UInt32,
    hunks_added UInt32,
    hunks_removed UInt32,
    hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;

CREATE TABLE git.file_changes
(
    change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4, 'Copy' = 5, 'Type' = 6),
    path LowCardinality(String),
    old_path LowCardinality(String),
    file_extension LowCardinality(String),
    lines_added UInt32,
    lines_deleted UInt32,
    hunks_added UInt32,
    hunks_removed UInt32,
    hunks_changed UInt32,

    commit_hash String,
    author LowCardinality(String),
    time DateTime,
    commit_message String,
    commit_files_added UInt32,
    commit_files_deleted UInt32,
    commit_files_renamed UInt32,
    commit_files_modified UInt32,
    commit_lines_added UInt32,
    commit_lines_deleted UInt32,
    commit_hunks_added UInt32,
    commit_hunks_removed UInt32,
    commit_hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;

CREATE TABLE git.line_changes
(
    sign Int8,
    line_number_old UInt32,
    line_number_new UInt32,
    hunk_num UInt32,
    hunk_start_line_number_old UInt32,
    hunk_start_line_number_new UInt32,
    hunk_lines_added UInt32,
    hunk_lines_deleted UInt32,
    hunk_context LowCardinality(String),
    line LowCardinality(String),
    indent UInt8,
    line_type Enum('Empty' = 0, 'Comment' = 1, 'Punct' = 2, 'Code' = 3),

    prev_commit_hash String,
    prev_author LowCardinality(String),
    prev_time DateTime,

    file_change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4, 'Copy' = 5, 'Type' = 6),
    path LowCardinality(String),
    old_path LowCardinality(String),
    file_extension LowCardinality(String),
    file_lines_added UInt32,
    file_lines_deleted UInt32,
    file_hunks_added UInt32,
    file_hunks_removed UInt32,
    file_hunks_changed UInt32,

    commit_hash String,
    author LowCardinality(String),
    time DateTime,
    commit_message String,
    commit_files_added UInt32,
    commit_files_deleted UInt32,
    commit_files_renamed UInt32,
    commit_files_modified UInt32,
    commit_lines_added UInt32,
    commit_lines_deleted UInt32,
    commit_hunks_added UInt32,
    commit_hunks_removed UInt32,
    commit_hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;

INSERT INTO SELECT와 s3 함수를 사용해 데이터를 삽입합니다. 예를 들어 아래에서는 각 ClickHouse 파일을 해당하는 테이블에 삽입합니다:

commits

INSERT INTO git.commits SELECT *
FROM s3('https://datasets-documentation.s3.amazonaws.com/github/commits/clickhouse/commits.tsv.xz', 'TSV', 'hash String,author LowCardinality(String), time DateTime, message String, files_added UInt32, files_deleted UInt32, files_renamed UInt32, files_modified UInt32, lines_added UInt32, lines_deleted UInt32, hunks_added UInt32, hunks_removed UInt32, hunks_changed UInt32')

0 rows in set. Elapsed: 1.826 sec. Processed 62.78 thousand rows, 8.50 MB (34.39 thousand rows/s., 4.66 MB/s.)

file_changes

INSERT INTO git.file_changes SELECT *
FROM s3('https://datasets-documentation.s3.amazonaws.com/github/commits/clickhouse/file_changes.tsv.xz', 'TSV', 'change_type Enum(\'Add\' = 1, \'Delete\' = 2, \'Modify\' = 3, \'Rename\' = 4, \'Copy\' = 5, \'Type\' = 6), path LowCardinality(String), old_path LowCardinality(String), file_extension LowCardinality(String), lines_added UInt32, lines_deleted UInt32, hunks_added UInt32, hunks_removed UInt32, hunks_changed UInt32, commit_hash String, author LowCardinality(String), time DateTime, commit_message String, commit_files_added UInt32, commit_files_deleted UInt32, commit_files_renamed UInt32, commit_files_modified UInt32, commit_lines_added UInt32, commit_lines_deleted UInt32, commit_hunks_added UInt32, commit_hunks_removed UInt32, commit_hunks_changed UInt32')

0 rows in set. Elapsed: 2.688 sec. Processed 266.05 thousand rows, 48.30 MB (98.97 thousand rows/s., 17.97 MB/s.)

line_changes

INSERT INTO git.line_changes SELECT *
FROM s3('https://datasets-documentation.s3.amazonaws.com/github/commits/clickhouse/line_changes.tsv.xz', 'TSV', '    sign Int8, line_number_old UInt32, line_number_new UInt32, hunk_num UInt32, hunk_start_line_number_old UInt32, hunk_start_line_number_new UInt32, hunk_lines_added UInt32,\n    hunk_lines_deleted UInt32, hunk_context LowCardinality(String), line LowCardinality(String), indent UInt8, line_type Enum(\'Empty\' = 0, \'Comment\' = 1, \'Punct\' = 2, \'Code\' = 3), prev_commit_hash String, prev_author LowCardinality(String), prev_time DateTime, file_change_type Enum(\'Add\' = 1, \'Delete\' = 2, \'Modify\' = 3, \'Rename\' = 4, \'Copy\' = 5, \'Type\' = 6),\n    path LowCardinality(String), old_path LowCardinality(String), file_extension LowCardinality(String), file_lines_added UInt32, file_lines_deleted UInt32, file_hunks_added UInt32, file_hunks_removed UInt32, file_hunks_changed UInt32, commit_hash String,\n    author LowCardinality(String), time DateTime, commit_message String, commit_files_added UInt32, commit_files_deleted UInt32, commit_files_renamed UInt32, commit_files_modified UInt32, commit_lines_added UInt32, commit_lines_deleted UInt32, commit_hunks_added UInt32, commit_hunks_removed UInt32, commit_hunks_changed UInt32')

0 rows in set. Elapsed: 50.535 sec. Processed 7.54 million rows, 2.09 GB (149.11 thousand rows/s., 41.40 MB/s.)

Queries

이 도구는 help 출력에서 여러 쿼리를 제안합니다. 본 문서에서는 이에 대한 답변과 함께, 추가로 흥미로운 보충 질문들에 대한 답변도 제공합니다. 이 쿼리들은 도구가 임의로 나열한 순서와는 별개로, 대략적으로 난이도가 증가하는 순서로 정리되어 있습니다.

이 데이터셋은 play.clickhouse.com의 git_clickhouse 데이터베이스에서 사용할 수 있습니다. 모든 쿼리에 대해 필요에 따라 데이터베이스 이름을 조정하여 이 환경에 대한 링크를 제공합니다. 데이터 수집 시점의 차이로 인해, play 환경에서의 결과는 여기서 제시된 결과와 달라질 수 있습니다.

단일 파일의 변경 이력

가장 간단한 쿼리입니다. 여기서는 StorageReplicatedMergeTree.cpp에 대한 모든 커밋 메시지를 살펴봅니다. 이 메시지들이 더 흥미로울 수 있으므로, 가장 최근 메시지가 먼저 오도록 정렬합니다.

데이터 생성​

데이터 다운로드 및 삽입​

Queries​

단일 파일의 변경 이력​

현재 활성 파일 찾기​

수정이 가장 많은 파일 나열​

커밋은 주로 일주일 중 어느 요일에 발생합니까?​

하위 디렉터리/파일의 변경 이력 - 시간 경과에 따른 줄 수, 커밋 수 및 기여자 수​

작성자가 가장 많은 파일 목록​

저장소에서 가장 오래된 코드 줄​

변경 이력이 가장 오래된 파일​

한 달 동안 문서와 코드 기준으로 본 기여자 분포​

가장 다양한 파일에 기여한 작성자​

특정 작성자의 선호 파일​

작성자 수가 가장 적은 대용량 파일​

시간대별 커밋 및 코드 줄 수 분포; 요일별, 작성자별, 특정 하위 디렉터리별​

어떤 작성자가 다른 작성자의 코드를 다시 작성하는 경향이 있는지 보여주는 작성자 매트릭스​

요일별로 기여 비율이 가장 높은 기여자는 누구입니까?​

저장소 전체의 코드 연령 분포​

특정 작성자가 작성한 코드 가운데 다른 작성자에 의해 제거된 비율은 얼마입니까?​

가장 많이 다시 수정된 파일 나열하기​

코드가 저장소에 남아 있을 확률이 가장 높은 요일은 언제입니까?​

평균 코드 연령으로 정렬된 파일​

누가 더 많은 테스트 / CPP 코드 / 주석을 작성하는 경향이 있을까요?​

작성자의 커밋에서 코드/주석 비율은 시간 경과에 따라 어떻게 변합니까?​

코드가 다시 작성되기까지의 평균 시간과 중앙값(코드 붕괴의 반감기)은 얼마입니까?​

코드가 나중에 다시 작성될 가능성이 가장 높은, 즉 코드를 작성하기에 가장 좋지 않은 시간은 언제입니까?​

어떤 작성자의 코드가 가장 오래 유지되나요?​

특정 작성자가 가장 많이 연속으로 커밋한 일수​

파일의 줄별 커밋 이력​

미해결 문제​

Git blame​

데이터 생성

데이터 다운로드 및 삽입

Queries

단일 파일의 변경 이력

현재 활성 파일 찾기

수정이 가장 많은 파일 나열

커밋은 주로 일주일 중 어느 요일에 발생합니까?

하위 디렉터리/파일의 변경 이력 - 시간 경과에 따른 줄 수, 커밋 수 및 기여자 수

작성자가 가장 많은 파일 목록

저장소에서 가장 오래된 코드 줄

변경 이력이 가장 오래된 파일

한 달 동안 문서와 코드 기준으로 본 기여자 분포

가장 다양한 파일에 기여한 작성자

특정 작성자의 선호 파일

작성자 수가 가장 적은 대용량 파일

시간대별 커밋 및 코드 줄 수 분포; 요일별, 작성자별, 특정 하위 디렉터리별

어떤 작성자가 다른 작성자의 코드를 다시 작성하는 경향이 있는지 보여주는 작성자 매트릭스

요일별로 기여 비율이 가장 높은 기여자는 누구입니까?

저장소 전체의 코드 연령 분포

특정 작성자가 작성한 코드 가운데 다른 작성자에 의해 제거된 비율은 얼마입니까?

가장 많이 다시 수정된 파일 나열하기

코드가 저장소에 남아 있을 확률이 가장 높은 요일은 언제입니까?

평균 코드 연령으로 정렬된 파일

누가 더 많은 테스트 / CPP 코드 / 주석을 작성하는 경향이 있을까요?

작성자의 커밋에서 코드/주석 비율은 시간 경과에 따라 어떻게 변합니까?

코드가 다시 작성되기까지의 평균 시간과 중앙값(코드 붕괴의 반감기)은 얼마입니까?

코드가 나중에 다시 작성될 가능성이 가장 높은, 즉 코드를 작성하기에 가장 좋지 않은 시간은 언제입니까?

어떤 작성자의 코드가 가장 오래 유지되나요?

특정 작성자가 가장 많이 연속으로 커밋한 일수

파일의 줄별 커밋 이력

미해결 문제

Git blame